reflected text analysis: computational linguistics meets...

67
Reflected Text Analysis: Computational Linguistics Meets Digital Humanities Nils Reiter Sarah Schulz

Upload: dangcong

Post on 17-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Reflected Text Analysis:Computational LinguisticsMeetsDigital Humanities

Nils ReiterSarah Schulz

Page 2: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

About this Talk

Computational Linguistics and Digital HumanitiesDH in StuttgartCoreference Resolution

If you should have questions, feel free to ask…

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 2

Page 3: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

ComputationalLinguisticsand DigitalHumanities

1

Page 4: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

CL is DH

Corpus Linguistics

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 4

Page 5: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

CL is DH

Corpus Linguistics investigates language with thehelp of corpora

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 4

Page 6: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

CL is DH

Corpus Linguistics investigates language with thehelp of corpora

ComputationalLinguistics

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 4

Page 7: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

CL is DH

Corpus Linguistics investigates language with thehelp of corpora

ComputationalLinguistics

DH

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 4

Page 8: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

CL is DH

Corpus Linguistics investigates language with thehelp of corpora

ComputationalLinguistics

DH

investigates techniques for theprocessing of natural language

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 4

Page 9: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

CL is DH

Computer Science

Linguistics

Corpus Linguistics investigates language with thehelp of corpora

ComputationalLinguistics

DH

investigates techniques for theprocessing of natural language

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 4

Page 10: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Computational Linguistics and Digital Humanities

Why do the humanities want to work with CL?many humanities projects are text-based

→ tools and workflows from CL can be adaptedto DH projects

larger corpora can be analysed

Why does CL want to work with the humanities?non-standard language, text structure, applicationsand research questions from DH are new for CLhumanities scholars provide expert knowledgeresearch question can limit the scope of what a toolhas to provide

both: collaboration is a chance and challenge

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 5

Page 11: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Computational Linguistics and Digital Humanities

Why do the humanities want to work with CL?many humanities projects are text-based

→ tools and workflows from CL can be adaptedto DH projects

larger corpora can be analysedWhy does CL want to work with the humanities?

non-standard language, text structure, applicationsand research questions from DH are new for CLhumanities scholars provide expert knowledgeresearch question can limit the scope of what a toolhas to provide

both: collaboration is a chance and challenge

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 5

Page 12: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Computational Linguistics and Digital Humanities

Why do the humanities want to work with CL?many humanities projects are text-based

→ tools and workflows from CL can be adaptedto DH projects

larger corpora can be analysedWhy does CL want to work with the humanities?

non-standard language, text structure, applicationsand research questions from DH are new for CLhumanities scholars provide expert knowledgeresearch question can limit the scope of what a toolhas to provide

both: collaboration is a chance and challenge

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 5

Page 13: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Annotation as Basis for… Everything

explicit assignment of categories to text spanstext spans are explicitly bounded (begin, end)

annotation by humans vs. annotation by computers

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 6

Page 14: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

The Day-to-Day Work of a Computational Linguist

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 7

Page 15: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Why do we annotate?

empirical validation of theoriesdiscovering phenomena not covered by a theorystrengthening definitions in a theory

often confused categories might be overlapping or atleast unclear

uncovering implicit assumptionsdata creation

manually annotated data can be analysedwhich categories are how frequent in what context?

automatic tools can be evaluatedHow well do machines do this task?

supervised tools can be trained

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 8

Page 16: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Annotationworkflow

Figure: Annotation workflow schema. Arrows indicate (rough) temporal sequence.

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 9

Page 17: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Inter-subjective Annotations in DH?

Linguistic annotations: Inter-subjective annotationsE.g., part of speechDifferent annotators create the same annotationsInter-Annotator Agreement (IAA)

Annotation guidelinesMediator between theory and annotationGuidebook for annotators (who may not be experts)

What is to be annotated?Which categories are used in which cases?How to deal with borderline cases?How have we dealt with difficult cases in the past?Tons of examples

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 10

Page 18: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

CL and DH inStuttgart

2

Page 19: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Vertiefung geisteswissen-schaftliches Fach

Spezialisierung Digital Humanities

Spezialisierung Informatik

1. Semester 2. Semester 3. Semester 4. Semester

M a s t e r s t u d i e n g a n g D i g i t a l H u m a n i t i e s , U n i v e rs i tä t S t u t t ga r t

Wahlbereich Geisteswissenschaften - Vertiefung Geisteswissenschaften* 12-18

Masterarbeit 30

DH in den Geisteswissenschaften I

(Ringvorlesung)6

DH in den Geisteswissenschaften II

*** 6

Theoretische und informatische

Grundlagen für die Digital Humanities

(VL+Ü)

9

Methoden der Digital Humanities (Sem) 6

Reflexion digitaler Methoden

(Sem)6

Projektarbeit (P) 9 Forschungskolloquium(Sem) 6

Methoden maschineller Sprachverarbeitung

(VL+Ü)9

Wahlbereich Informatik** 12-18

Programmierkurs (Ü)** 3

Semester 1 30 LP Semester 2 30

LP Semester 3 30 LP Semester 4 30

LP

* Importmodule aus den Geisteswissenschaften

** Importmodule Informatik

*** DH-Veranstaltung in den Geisteswissenschaften - Veranstaltung wird vom geisteswissenschaftlichen Fach angeboten 14.04.2016

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 12

Page 20: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Ver�efung

geisteswissen-

scha�liches Fach

Spezialisierung

Digital Humani�es

Spezialisierung

Informa�k

1. Semester 2. Semester 3. Semester 4. Semester

M a s t e r s t u d i e n g a n g D i g i t a l H u m a n i t i e s , U n i v e rs i t ä t S t u t t ga r t

Wahlbereich Geisteswissenschaften - Vertiefung Geisteswissenschaften* 12-18

Masterarbeit 30

DH in den Geisteswissenschaften I

(Ringvorlesung)6

DH in den Geisteswissenschaften II

*** 6

Theoretische und informatische

Grundlagen für die Digital Humanities

(VL+Ü)

9

Methoden der Digital Humanities (Sem)

6Reflexion digitaler

Methoden(Sem)

6

Projektarbeit (P) 9Forschungskolloquium

(Sem)6

Methoden maschineller Sprachverarbeitung

(VL+Ü)9

Wahlbereich Informatik** 12-18

Programmierkurs (Ü)** 3

Semester 130 LP

Semester 230 LP

Semester 330 LP

Semester 430 LP

* Importmodule aus den Geisteswissenschaften

** Importmodule Informatik

*** DH-Veranstaltung in den Geisteswissenschaften - Veranstaltung wird vom geisteswissenschaftlichen Fach angeboten 14.04.2016

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 12

Page 21: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

NLP Methods for DH

Two main ideascover some basic NLP taskscover conceptual workflow

PoS tagging Named EntityRecognition

Coreference Resolution

Annotation STTS, IAA, Kappa NoSta-D, BIO TueBa + NoSta-DModelling Decision Trees HMM, CRF ClusteringEvaluation Accuracy Precision/Recall Specific metrics (CEAF,

B3, BLANC)

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 13

Page 22: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

QuaDramA1

Test hypotheses about dramatic characters on large data setsusing NLP methods

Dimensions:Character classes by gender, action, social class, stockcharacter

How are fathers marked as being ‘tender’?Relations between characters

What are shared topics/emotions?Character (type) development

When are characters types appearing/disappearing?

1Project members: Nils Reiter, Marcus Willand, Benjamin Krauter, Janis Pagel

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 14

Page 23: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Co-presence of Characters

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 15

Page 24: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Center for Reflected Text Analytics (CRETA)

Social Sciences

Linguistics

Philosophy

Medieval

German

Studies

Natural Language

Processing

Fak. 10

Fak.9

CRETA

Digital

Humanities

Fak. 5Visualisation and

Interactive

Systems

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 16

Page 25: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Narrative Structure of Arthurian Romance (Example Parzival)1

1Project members: Nora Ketschik, André Blessing, Manuel Braun

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 17

Page 26: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Wertheradaptionen1

Figure: Anzahl publizierter Wertheriaden seit der Publikation des Originals in 1774.

1Projektbeteiligte: Sandra Murr, Sandra Richter

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 18

Page 27: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Love Triangle

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 19

Page 28: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Religion in International Newspaper Discourse1

1Project mebers: Maximilian Overbeck, Cathleen Kantner

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 20

Page 29: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Concepts and their Connections in Adorno1

“Interessant ist, daß sich mir bei der Arbeit aus dem Inhalt derGedanken gewisse Konsequenzen für die Form aufdrängen,die ich längst erwartete, aber die mich nun doch überraschen.Es handelt sich ganz einfach darum, daß aus meinemTheorem, daß es philosophisch nichts ‘Erstes’ gibt, nun auchfolgt, daß man nicht einen argumentativen Zusammenhang inder üblichen Stufenfolge aufbauen kann, sondern daß man dasGanze aus einer Reihe von Teilkomplexen montieren muß, diegleichsam gleichgewichtig sind und konzentrisch angeordnet,auf gleicher Stufe; deren Konstellation, nicht die Folge, mußdie Idee ergeben.”

1Project member: Axel Pichler

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 21

Page 30: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Concepts and their Connections in Adorno1

Figure: Term networks in Wittgenstein’s Tractatus (left) and Adorno’s “Ästhetische Theorie” (right)

1Project member: Axel Pichler

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 22

Page 31: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Common interest

Individually distinguishable objects in the real or a fictionalworld

“Individually distinguishable” = by naming“objects” does not imply physical objects

References to such entities in textsProper names (“Angela Merkel”)Descriptive noun phrases (“the chancellor”)Pronouns (“she”)

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 23

Page 32: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Entities

ParzivalCharacters und groups of characters (“three knights”)Locations

WertherCharacters: Werther, Lotte, Albert

Parliamentary debatesPolitical parties, international organisations (“EUcommission”, “United Nations general assembly”)

Aesthetic TheoryPhilosophers, works, abstract concepts

Dramatic TextsCharacters: Hamlet, Sara Sampson, the priest

Nils Reiter, Sarah Schulz, IMS, Universität Stuttgart: Digitale Methoden in den Geisteswissenschaften 24

Page 33: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Annotating Coreference

Page 34: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• If two linguistic units co-refer, they refer to the same non-textual entity• [A woman] bought a house. [The woman] lived happily ever after.

(prototypical)• [A woman] bought a house. [She] lived happily ever after. (anaphoric)• [Mary] bought a house. [The woman] enjoyed the garden. (coreferential)• [Mary likes to sing in the shower.] Her boyfriend hates [that]. (non-nominal

antecedent)

Definition and ExamplesCoreference

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 35: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• If two linguistic units co-refer, they refer to the same non-textual entity• [A woman] bought a house. [The woman] lived happily ever after.

(prototypical)• [A woman] bought a house. [She] lived happily ever after. (anaphoric)• [Mary] bought a house. [The woman] enjoyed the garden. (coreferential)• [Mary likes to sing in the shower.] Her boyfriend hates [that]. (non-nominal

antecedent)

• Complications• Embedded mentions: [ [Donald Trumps] Gouvernement]• Plural mentions: [John] was cooking for [Mary]. [They] enjoyed it.• Discontinuity/relative clauses: Ich habe [die Sendungen] zur Post gebracht,

[ [die] hier schon lange rumliegen].• ...

Definition and ExamplesCoreference

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 36: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Mention detection

• Find all linguistic units that refer (to something)

• Named entity detection + X • Noun phrases (the gardener)

• Almost all NPs can be used both non-referring and referring

• Pronouns (she)

• Trivial, since it’s a fixed list

• Coreference resolution

• Group mentions that co-refer

• Most simple approach: Classification of mention pairs

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Two tasksCoreference

Page 37: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Mention detection• Find all linguistic units that refer (to something)• Named entity detection + X

• Noun phrases (the gardener)• Almost all NPs can be used both non-referring and referring

• Pronouns (she)• Trivial, since it’s a fixed list

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Two tasksCoreference

Page 38: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Create a gold standard for a couple of relevant texts

• Allows for experimentation of simple hacks (dramas) as well as more

complex systems

• Be compatible to standard coreference as much as possible

• Reuse of annotation guidelines (Naumann, 2007)

• Potential integration of tools

• Potential combination of annotated data (→ domain adaptation)

• Apply the same/similar scheme to prose, drama, and non-literary texts

Goals

Annotating Coreference

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 39: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Singletons are not annotated

• Non-nominals are included (as antecedents)

• Deviations from standard coreference• Chains: Elements of a chain form an equivalence set• No annotation of link type (e.g., anaphoric vs. coreferential

vs. cataphoric)• Can be detected automatically

GuidelinesAnnotating Coreference

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 40: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Prose • Goethe: Die Leiden des Jungen Werther (1787)• Perutz: Zwischen neun und neun (1918), Nur ein Druck auf den Knopf (1930), Der

Mond lacht (1930)• Rowlandson: Narrative of the Captivity and Restoration of Mrs. Mary Rowlandson

(1682, English)• Fairy tales

• Drama• Lessing: Miß Sara Sampson (Bürgerliches Trauerspiel)• Schiller: Die Räuber (Sturm und Drang)• Hofmannsthal: Der Rosenkavalier (?)

• Drama-like• Bundestag: Parliamentary debates (1996-2015)

CorpusAnnotating Coreference

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 41: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Annotation IssuesAnnotating Coreference

Page 42: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Der Wirt. […] Was liegt mir daran, ob ich es weiß, oder nicht, was Sie für eine Ursache hierher führt, und warum Sie bei mir im Verborgnen sein wollen? Ein Wirt nimmt sein Geld, und läßt seine Gäste machen, was ihnen gut dünkt. Waitwell hat mir zwar gesagt, […]

(Lessing, Miß Sara Sampson)

Use of generic noun phrases (1/2)Annotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 43: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Der Wirt. […] Was liegt mir daran, ob ich es weiß, oder nicht, was Sie für eine Ursache hierher führt, und warum Sie bei mir im Verborgnen sein wollen? Ein Wirt nimmt sein Geld, und läßt seine Gäste machen, was ihnen gut dünkt. Waitwell hat mir zwar gesagt, […]

(Lessing, Miß Sara Sampson)

[Sie]1 ließ [[ihren]2 Regenschirm]3 fallen. [Jeder junge Mann]4 wird in einem solchen Fall blitzschnell nach [dem Schirm]5 greifen und [ihn]6 [der Dame]7überreichen. Und [die Dame]8 bedankt sich vielmals. Aber diesmal geschah etwas Unerhörtes. [Stanislaus Demba]9 ließ [den Schirm]10 liegen.

(Perutz, Zwischen Neun und Neun)

Use of generic noun phrases (1/2)Annotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 44: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Generic expressions are relatively frequent• “[Ihr Mannspersonen] müßt doch selbst nicht wissen, was ihr wollt.”• “Mellefont besitzt alles, was uns [eine Mannsperson] gefährlich machen

kann.”• Should generic expressions be co-referent (e.g., multiple references to

the class man)?• Standard guidelines: No coreference between generic mentions

• Generic noun phrases not specific to literary texts

• Switching between generic and non-generic use is more frequent than in newspaper texts

Use of generic noun phrases (2/2)Annotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 45: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Entity development and character perception

Annotation Issues

• Characters change over the

course of the text (frog

prince)

• Characters in disguise (red

riding hood: wolf poses as

grandmother)

Min

imalis

t fa

iry tale

poste

rs b

y C

hristian J

ackson

• Annotating reader knowledge possible in both cases

• What do we loose?

• What knowledge do we define?

• Important for reaching inter-subjective annotationsReiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 46: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Hermann. [...] Da Karl auf der Welt nichts mehr zu hoffen hatte, zog ihn der Hall von Friederichs siegreicher Trommel nach Böhmen. Erlaubt mir, sagte er [zum großen Schwerin], daß ich den Tod sterbe auf dem Bette der Helden [...].

(Schiller, Die Räuber)

• Friederich == großer Schwerin?

World knowledgeAnnotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 47: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Hermann. [...] Da Karl auf der Welt nichts mehr zu hoffen hatte, zog ihn der

Hall von Friederichs siegreicher Trommel nach Böhmen. Erlaubt mir, sagte

er [zum großen Schwerin], daß ich den Tod sterbe auf dem Bette der

Helden [...].

(Schiller, Die Räuber)

• Friederich == großer Schwerin?

• No. Großer Schwerin = Kurt Christoph Graf von Schwerin

• https://de.wikipedia.org/wiki/Kurt_Christoph_von_Schwerin

World knowledge

Annotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 48: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Mellefont. Mit Unrecht tadelt sie die Verzögerung [einer Zeremonie] [...].

Sara. Neue Freunde sollen die Zeugen [unserer Verbindung] sein? [...]

Mellefont. Aber überlegen Sie denn nicht, Miss, dass [unserer Verbindung]

hier diejenige Feier fehlen würde, die wir ihr zu geben schuldig sind?

(Lessing, Miß Sara Sampson)

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Lexical variation / coreference vs. bridging

Annotation Issues

Page 49: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Mellefont. Mit Unrecht tadelt sie die Verzögerung [einer Zeremonie] [...].

Sara. Neue Freunde sollen die Zeugen [unserer Verbindung] sein? [...]

Mellefont. Aber überlegen Sie denn nicht, Miss, dass [unserer Verbindung] hier diejenige Feier fehlen würde, die wir ihr zu geben schuldig sind?

(Lessing, Miß Sara Sampson)

• “Verbindung” refers to both the wedding and the marriage

• Related to bridging: Reference to entities that are related to previously mentioned entities• Mary bought [a car]. [The engine] is strong.

• (bridging relation between car and engine)

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Lexical variation / coreference vs. bridgingAnnotation Issues

Page 50: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Mellefont. […] ich hatte noch keine verwahrlosete Tugend auf meiner

Seele. Ich hatte noch keine Unschuld in ein unabsehliches Unglück

gestürzt. Ich hatte noch keine Sara aus dem Hause eines geliebten Vaters

entwendet, und sie gezwungen, einem Nichtswürdigen zu folgen, der auf

keine Weise mehr sein eigen war.

• At utterance time, Mellefont has abducted Sara.

• “kein X”: Typically considered to be generic and therefore non-referring

• But clearly related to (and understood as) a non-generic entity (Sara)

Colorful language

Annotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 51: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Mellefont. […] ich hatte noch keine verwahrlosete Tugend auf meiner

Seele. Ich hatte noch keine Unschuld in ein unabsehliches Unglück

gestürzt. Ich hatte noch keine Sara aus dem Hause eines geliebten Vaters

entwendet, und sie gezwungen, einem Nichtswürdigen zu folgen, der auf

keine Weise mehr sein eigen war.

• At utterance time, Mellefont has abducted Sara.

• “kein X”: Typically considered to be generic and therefore non-referring

• But clearly related to (and understood as) a non-generic entity (Sara)

Sara. […] Wenn zum Exempel, [ein Mellefont] [eine Marwood] liebt, und sie

endlich verläßt

• Proper nouns with indefinite article

Colorful language

Annotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 52: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Mellefont. Eine schlechte Vorbereitung, eine [trost]suchende Betrübte zu empfangen. Warum sucht [ihn] auch bei mir? – Doch wo soll sie [ihn] sonst suchen?

• Only possible antecedent for „ihn“: trost

• Sub-token annotation!

Waitwell. Sara liebt ihren Vater noch. Wenn Sie nur [davon] überzeugt sein wollen

• „davon“ refers to the entire sentence

Non-nominal antecedentsAnnotation Issues

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Page 53: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

“Vierzehn Personen, darunter zwei Professoren der Universität, […] Der Universitätsprofessor versuchte es […][…] und einer von den Professoren versuchte [… ]Einer von den Universitätsprofessoren trat hin […]”

(Perutz, Nur ein Druck auf den Knopf)

• Two references must be co-referential but we don’t know which ones

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

True ambiguitiesAnnotation Issues

Page 54: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

“Vierzehn Personen, darunter zwei Professoren der Universität, […] Der Universitätsprofessor versuchte es […][…] und einer von den Professoren versuchte [… ]Einer von den Universitätsprofessoren trat hin […]”

(Perutz, Nur ein Druck auf den Knopf)

• Two references must be co-referential but we don’t know which ones

• Other examples:

• Rhetorical “we” in parliamentary debates (we the opposition, we the CDU, we the Europeans, we the audience, we politicians, …)

• Groupings in Schiller’s Räuber: Who exactly belongs to the group at which point is impossible to determine

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

True ambiguitiesAnnotation Issues

Page 55: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Annotation ToolAnnotating Coreference

Page 56: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Basics: Group mentions that refer to the same entity

• Handling of groups, discontinuous clauses

• Long texts

• A lot of entities

• Flexible Im- and Export

• E.g., TEI/XML

• Handling of text structure (acts, scenes, who speaks, stage directions)

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

RequirementsAnnotation Tool

Page 57: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Annotators edit TEI/XML files directly

• Readily available

• Well-formedness/validity can be ensured automatically

• Slow, cumbersome, somewhat error-prone

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Failed but tried

Annotation Tool

<sp who="#sir_william"><speaker>SIR WILLIAM.</speaker><l> Hier <rs ref="#sara">meine Tochter</rs>? Hier in <rs xml:id="wirtshaus">

diesem elenden Wirtshause</rs>?</l></sp>

Page 58: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Annotators edit TEI/XML files directly

• Readily available• Well-formedness/validity can be ensured automatically• Slow, cumbersome, somewhat error-prone

• WebAnno https://webanno.github.io• Conversion from/to TEI/XML• Web-based (distribution of documents, update of software)• Impossible to handle many long chains• UI unresponsive for long documents

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Failed but triedAnnotation Tool

<sp who="#sir_william"><speaker>SIR WILLIAM.</speaker><l> Hier <rs ref="#sara">meine Tochter</rs>? Hier in <rs xml:id="wirtshaus">

diesem elenden Wirtshause</rs>?</l></sp>

Page 59: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Annotators edit TEI/XML files directly

• Readily available• Well-formedness/validity can be ensured automatically• Slow, cumbersome, somewhat error-prone

• WebAnno https://webanno.github.io• Conversion from/to TEI/XML• Web-based (distribution of documents, update of software)• Impossible to handle many long chains• UI unresponsive for long documents

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Failed but triedAnnotation Tool

<sp who="#sir_william"><speaker>SIR WILLIAM.</speaker><l> Hier <rs ref="#sara">meine Tochter</rs>? Hier in <rs xml:id="wirtshaus">

diesem elenden Wirtshause</rs>?</l></sp>

Page 60: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

A new annotation tool!• Developed for long texts with many entities

• Entities represented as sets of mentions, no binary relations between mentions

• Useable with a keyboard for fast annotation

• Automatic generation of candidate entitiesbased on context and previous annotations (semi-automatic annotation)

• Annotations stored in stand-off format using UIMA as a platform• Compatible to WebAnno

• Import/Export in various formats (CoNLL, TEI/XML*)

• Open Source: Apache License 2.0

• https://github.com/nilsreiter/CorefAnnotator

University of Stuttgart 31

CorefAnnotatorAnnotation Tool

* Given som

e preconditions

Page 61: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Developed for long texts with many entities

• Entities represented as sets of mentions, no binary relations between mentions

• Useable with a keyboard for fast annotation

• Automatic generation of candidate entities based on context and previous annotations (semi-automatic annotation)

• Annotations stored in stand-off format using UIMA as a platform

• Open Source: Apache License 2.0

• https://github.com/nilsreiter/CorefAnnotator

University of Stuttgart 32

CorefAnnotatorAnnotation Tool

Page 62: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Developed for long texts with many entities

• Entities represented as sets of mentions, no binary relations between mentions

• Useable with a keyboard for fast annotation

• Automatic generation of candidate entities based on context and previous annotations (semi-automatic annotation)

• Annotations stored in stand-off format using UIMA as a platform

• Open Source: Apache License 2.0

• https://github.com/nilsreiter/CorefAnnotator

University of Stuttgart 33

CorefAnnotatorAnnotation Tool

Page 63: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Developed for long texts with many entities

• Entities represented as sets of mentions, no binary relations between mentions

• Useable with a keyboard for fast annotation

• Automatic generation of candidate entities based on context and previous annotations (semi-automatic annotation)

• Annotations stored in stand-off format using UIMA as a platform

• Open Source: Apache License 2.0

• https://github.com/nilsreiter/CorefAnnotator

University of Stuttgart 34

CorefAnnotatorAnnotation Tool

Page 64: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Developed for long texts with many entities

• Entities represented as sets of mentions, no binary relations between mentions

• Useable with a keyboard for fast annotation

• Automatic generation of candidate entities based on context and previous annotations (semi-automatic annotation)

• Annotations stored in stand-off format using UIMA as a platform

• Open Source: Apache License 2.0

• https://github.com/nilsreiter/CorefAnnotator

University of Stuttgart 35

CorefAnnotatorAnnotation Tool

Page 65: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Summary

Page 66: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

• Lessons learned• Clarify concepts through annotation and discussion• Annotation useful in DH, because close reading• Common sub tasks can be addressed across disciplines/research

questions• Expert knowledge helpful for method development• Coreference fascinating task when applied to literary texts because it

highlights the blind spots of linguistic theory / CL assumptions

• Collaboration• Two-way street

• CL: New challenges for established workflows• DH: Quantification and inter-subjectivity

Reiter, Schulz, Uni Stuttgart: Reflected Text Analysis: Computational Linguistics meets Digital Humanities

Summary

Page 67: Reflected Text Analysis: Computational Linguistics Meets ...philtypo3.uni-koeln.de/sites/spinfo/dch/dhc-kolloq-schulz-reiter-2018-06-07.pdf · Ver efung geisteswissen-scha liches

Vielen Dank!

E-MailTelefon +49 (0) 711 685-Fax +49 (0) 711 685-

Universität Stuttgart

Pfaffenwaldring 5b70569 Stuttgart

Sarah Schulz, Nils Reiter

8135481366

Institut für Maschinelle Sprachverarbeitung

[email protected]