digital humanities: modeling semi-structured data from ...outline intro: a few thoughts on...

86
Digital humanities: modeling semi-structured data from traditional scholarship Tom Lippincott IntroHLT Fall 2019 Human Language Technology Center of Excellence Center for Language and Speech Processing 1

Upload: others

Post on 28-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Digital humanities: modelingsemi-structured data from traditionalscholarship

Tom LippincottIntroHLT Fall 2019Human Language Technology Center of ExcellenceCenter for Language and Speech Processing

1

Page 2: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Outline

Intro: A few thoughts on “Digital humanities”

Motivating study: Post-Atlantic Slave Trade

Model: Graph-Entity Autoencoders

Bonus study: Authorship attribution of ancient documents

Ongoing work

2

Page 3: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Intro: A few thoughts on “Digitalhumanities”

Page 4: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

What is “digital humanities”?

Some responses:

• “an idea that will increasingly become invisible” -Stanford

• “a term of tactical convenience” -UMD

• “I don’t: I’m sick of trying to define it” -GMU

• “a convenient label, but fundamentally I dont believe in it”-NYU

• “an unfortunate neologism” -Library of Congress

3

Page 5: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

What is “digital humanities”?

Themes at DH2019

• Visualization

• Geographic information systems

• Social and ethical issues

• Education

• VR, maker spaces

• OCR

• Machine learning

4

Page 6: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Working definitions

Digital humanities

Traditional inquiries enabled by computational intelligence

Traditional researcher

Academic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . .

(Traditional) scholarly dataset

Data assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 7: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional researcher

Academic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . .

(Traditional) scholarly dataset

Data assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 8: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional researcherAcademic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . .

(Traditional) scholarly dataset

Data assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 9: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional researcherAcademic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . .

(Traditional) scholarly datasetData assembled by a traditional researcher in the field

Computational researcher

Design and bring machine learning models to bear on datasets

5

Page 10: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Working definitions

Digital humanitiesTraditional inquiries enabled by computational intelligence

Traditional researcherAcademic from field that doesn’t typically employ quantitativemethods (History, Literary Criticism, . . .

(Traditional) scholarly datasetData assembled by a traditional researcher in the field

Computational researcherDesign and bring machine learning models to bear on datasets

5

Page 11: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Why is collaboration rare?

Traditional researchers have insight into the data

• Data is painstakingly gathered and coveted

• Hypotheses are subtle but not numerically evaluated

• May publish one or two papers during PhD, but dissertationis primary focus

Machine learning researchers can pair data with appropri-ate models

• Data is aggressively shared to encourage rigorousevaluation

• Tasks are often shallow and prespecified

• Publish multiple papers per year

6

Page 12: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Why is collaboration rare?

Traditional researchers have insight into the data

• Data is painstakingly gathered and coveted

• Hypotheses are subtle but not numerically evaluated

• May publish one or two papers during PhD, but dissertationis primary focus

Machine learning researchers can pair data with appropri-ate models

• Data is aggressively shared to encourage rigorousevaluation

• Tasks are often shallow and prespecified

• Publish multiple papers per year

6

Page 13: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Why is collaboration rare?

Traditional researchers have insight into the data

• Data is painstakingly gathered and coveted

• Hypotheses are subtle but not numerically evaluated

• May publish one or two papers during PhD, but dissertationis primary focus

Machine learning researchers can pair data with appropri-ate models

• Data is aggressively shared to encourage rigorousevaluation

• Tasks are often shallow and prespecified

• Publish multiple papers per year

6

Page 14: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Topic models: the rare success story

Widely used

• Low barrier to entry: everyone has “documents”

• Little expertise required

• Output easy to visualize and interpret

Widely abused

• Deceptively easy to use: it will give you something

• You can always find “patterns”: confirmation bias abounds

• Older than some undergrads: LDA from early 2000s

7

Page 15: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Topic models: the rare success story

Widely used

• Low barrier to entry: everyone has “documents”

• Little expertise required

• Output easy to visualize and interpret

Widely abused

• Deceptively easy to use: it will give you something

• You can always find “patterns”: confirmation bias abounds

• Older than some undergrads: LDA from early 2000s

7

Page 16: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Topic models: the rare success story

Widely used

• Low barrier to entry: everyone has “documents”

• Little expertise required

• Output easy to visualize and interpret

Widely abused

• Deceptively easy to use: it will give you something

• You can always find “patterns”: confirmation bias abounds

• Older than some undergrads: LDA from early 2000s

7

Page 17: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

A guiding challenge:

Can we leverage sophisticated modeling techniques withoutlosing the advantages that popularize topic models andrecreating some of the same bad community practices?

8

Page 18: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Aside: Traditional Researchers are Knowledge Workers

Financial analysts, investigative reporters . . .

• Concerned with specific domains

• Need to gather, build, and understand datasets

• Wide range of technical abilities

• The DH story is relevant to industry, government, etc

9

Page 19: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Motivating study: Post-AtlanticSlave Trade

Page 20: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

10

Page 21: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

10

Page 22: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Shipping manifests

slave slave slave owner journey vesselname sex age name date type

Willis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

10

Page 23: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 Schooner

Maria f 19 Amidu 1832/09/24 Schooner

10

Page 24: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Shipping manifests

slave slave slave owner journey vesselname sex age name date typeWillis m 20 Amidu 1832/9/24 SchoonerMaria f 19 Amidu 1832/09/24 Schooner

10

Page 25: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward dateDavy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

11

Page 26: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward dateDavy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

11

Page 27: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward date

Davy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

11

Page 28: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Fugitive notices

slave slave escape escape owner notice noticename sex date location name reward dateDavy m 1795/10/15 Port Tobacco Bourman 3 Pistoles 1796/02/21

11

Page 29: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Some numbers

• 45k manifest entries spanning five cities

• 11k fugitive notices from 70 gazettes

• 28k unique slave names

• 7k unique owner names

• Not big data, but thousands of studies like this at aresearch university!

12

Page 30: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

13

Page 31: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

13

Page 32: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

13

Page 33: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Difficulties with data in the wild

• Unnormalized• People/places/things recorded many times• “What’s the age/height/sex distribution of escapees?”

• Noisy• Vessel type: Bark, Barke, BArque, Barque, Barques• Slave name: “Nelly’?, Nelly’s child”, “not visible”• Owner sex: 3k missing

• Underspecified entities• Majority of slaves have no last name• Can’t tell if two “Johns” are the same person

13

Page 34: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

What might a historian want to do with this data?

• Follow one slave throughout their life

• Group owners according to the nature of their workforce

• Determine what drove valuation in transactions andrewards

• Reconstruct slave families when there are no last names

• Map out trade “ecosystems” of sellers, shippers, owners,etc

14

Page 35: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Fundamental observation

There is an implicit database schema here

• Field: a recorded value with a clear interpretation (age,name, manufacturer . . . )

• Entity-type: a coherent bundle of fields (person, location,object . . . )

• Entity-types and fields have been determined by traditionalscholars and common sense

• Relations between entities are also (conservatively)implied by the tabular format

This sets things up so we (ML researchers) can tackle thegeneral problem

15

Page 36: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Fundamental observation

There is an implicit database schema here

• Field: a recorded value with a clear interpretation (age,name, manufacturer . . . )

• Entity-type: a coherent bundle of fields (person, location,object . . . )

• Entity-types and fields have been determined by traditionalscholars and common sense

• Relations between entities are also (conservatively)implied by the tabular format

This sets things up so we (ML researchers) can tackle thegeneral problem

15

Page 37: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Traditional scholarly data

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 38: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Numbers

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 39: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Categories

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 40: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Strings

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 41: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

More complex fields

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 42: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Entities

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 43: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Entities

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 44: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Slave-to-owner

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 45: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Vessel-to-voyage, slave-to-voyage

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 46: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Entities, field types, and relations

Fewer assumptions

slave name Jimslave age 20owner name Janeowner sex Fvessel name Uncasvessel type Brig

voyage date 6/2/1823

voyage dest 29.9,90.0. . . . . .

Row 1

16

Page 47: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Full schema of possible entity relationships

17

Page 48: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Example data point: one graph component

18

Page 49: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Subsumes studies from a wide range of domains

19

Page 50: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Model: Graph-Entity Autoencoders

Page 51: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

General scholarly questions

• What is the overall picture of a particular entity?

• In what ways can we group entities?

• How are fields correlated?

• What missing fields and relationships can be recovered?

Three basic operations we’d like:

• Measure similarity of two entities

• Generate plausible field-values

• Score a proposed relationship between two entities

20

Page 52: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

General scholarly questions

• What is the overall picture of a particular entity?

• In what ways can we group entities?

• How are fields correlated?

• What missing fields and relationships can be recovered?

Three basic operations we’d like:

• Measure similarity of two entities

• Generate plausible field-values

• Score a proposed relationship between two entities

20

Page 53: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Building-blocks of the model

Encoders, decoders, and autoencoders

Graph convolutional networks

21

Page 54: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

; ; ; ;

22

Page 55: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Basic feed-forward neural model

(an “encoder”)

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

22

Page 56: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Basic feed-forward neural model

(an “encoder”)

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

22

Page 57: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Basic feed-forward neural model

(an “encoder”)

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

22

Page 58: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Basic feed-forward neural model (an “encoder”)

Hiddenlayers

Someinputdata

Task

Classification,regression,generation,

some combo . . .

Bottleneck

22

Page 59: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

A “decoder” goes in the opposite direction

Hiddenlayers

Someoutputdata

23

Page 60: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Encoders and decoders are often paired

Theseare thesame

CliffNotesStudies Recites

24

Page 61: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

If the goal is to reconstruct the input, it’s an “autoencoder”

Theseare thesame

CliffNotesStudies Recites

24

Page 62: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Convolutional networks (CNNs)

Grid(image,text . . . )

Eachpositionincor-

poratesits “re-ceptivefield”

Repeatprocess,expand

field

Graphnodes(e.g.

entities)

Adjacentnodes

(relatedentities)

Eachnode

incorpo-rates itsneigh-bors

Infospreadsaccord-ing tograph

25

Page 63: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Convolutional networks (CNNs)

Grid(image,text . . . )

Eachpositionincor-

poratesits “re-ceptivefield”

Repeatprocess,expand

field

Graphnodes(e.g.

entities)

Adjacentnodes

(relatedentities)

Eachnode

incorpo-rates itsneigh-bors

Infospreadsaccord-ing tograph

25

Page 64: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Convolutional networks (CNNs)

Grid(image,text . . . )

Eachpositionincor-

poratesits “re-ceptivefield”

Repeatprocess,expand

field

Graphnodes(e.g.

entities)

Adjacentnodes

(relatedentities)

Eachnode

incorpo-rates itsneigh-bors

Infospreadsaccord-ing tograph

25

Page 65: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Graph-convolutional networks (GCNs)

Grid(image,text . . . )

Eachpositionincor-

poratesits “re-ceptivefield”

Repeatprocess,expand

field

Graphnodes(e.g.

entities)

Adjacentnodes

(relatedentities)

Eachnode

incorpo-rates itsneigh-bors

Infospreadsaccord-ing tograph

25

Page 66: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Graph-convolutional networks (GCNs)

Grid(image,text . . . )

Eachpositionincor-

poratesits “re-ceptivefield”

Repeatprocess,expand

field

Graphnodes(e.g.

entities)

Adjacentnodes

(relatedentities)

Eachnode

incorpo-rates itsneigh-bors

Infospreadsaccord-ing tograph

25

Page 67: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Graph-convolutional networks (GCNs)

Grid(image,text . . . )

Eachpositionincor-

poratesits “re-ceptivefield”

Repeatprocess,expand

field

Graphnodes(e.g.

entities)

Adjacentnodes

(relatedentities)

Eachnode

incorpo-rates itsneigh-bors

Infospreadsaccord-ing tograph

25

Page 68: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Graph-convolutional networks (GCNs)

Grid(image,text . . . )

Eachpositionincor-

poratesits “re-ceptivefield”

Repeatprocess,expand

field

Graphnodes(e.g.

entities)

Adjacentnodes

(relatedentities)

Eachnode

incorpo-rates itsneigh-bors

Infospreadsaccord-ing tograph

25

Page 69: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Full model summary

• Read data, determine:• Fields and field-types• Entity-types• Relationships

• Each field allocated appropriate encoder-decoder pair

• Each entity-type allocated autoencoder

• Autoencoders use GCN-like mechanism to incorporateadjacent bottlenecks

26

Page 70: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Model sketch

27

Page 71: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Training is a complex process

• Random field dropout

• Graph component subselection

• Ways to combine loss functions

• . . .

28

Page 72: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

How can we use a trained model?

• Compute distance between two entities

• Find flat or hierarchical clusters of entities

• Generate likely value of missing field

• Detect an improbable value of a present field

• Observe response of one field to another

29

Page 73: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Example insights looking at most-similar entities

Mistranscriptions

Baltiomre, Austin Woolfolk ⇐⇒ Baltimore, Austin WoolfolkNew Orleans, William Wiliams ⇐⇒ New Orleans, William Williams

Semantically-equivalent variants

Baltimore, George Y. Kelso ⇔ Baltimore, Kelso & FergusonNew Orleans, Leon Chabert ⇔ Louisiana, Leon Chabert

Same slave transported multiple times

Louisa, F, 16yo ⇔ Louisa, F, 17yoWaters, F, 14yo ⇔ Waters, F, 15yoKesiah, F, 20yo ⇔ Kesiah, F, 22yoTaylor, F, 15yo ⇔ Taylor, F, 16yo

30

Page 74: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Bonus study: Authorshipattribution of ancient documents

Page 75: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Transmission of a text: the “Documentary Hypothesis”

31

Page 76: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Hypothesis as pointers into document structure

Jehovist

Elohist

Redactor

Bible

Book 1 Book 2

Chapter 1 Chapter 2

Verse 1 Verse 2

Word 1 Word 2

Morph 1 Morph 2

32

Page 77: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Thomas Mendenhall: The Characteristic Curves of Compo-sition

33

Page 78: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Mosteller and Wallace: Inference in an Authorship Problem

The Federalist papers

• 85 articles written by Hamilton, Madison, and Jay

• 12 are unattributed

• Frequency analysis of function words determined Madisonas author

34

Page 79: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Back to the Documentary Hypothesis

Problems

• The “authors” are also editors, redactors, synthesizers. . . they interact in context-dependent ways

• There is no predefined segmentation into “articles”

• We know more than function-words are important (e.g.name of God)

Solutions

• Limit vocabulary to words that are used frequently by allauthors

• Employ a GCN to exploit the document structure

• Take the DH for granted (for now)

35

Page 80: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

GEA predicts the author slightly better . . .

Model F-scoreLR 41.39MLP 47.45GEA 48.60

Gold GuessJ E P 1D 2D nD R O

J 100 8 7 0 0 0 3 0E 22 53 8 0 0 0 0 0P 13 5 77 0 1 0 4 01D 2 0 2 7 1 0 0 02D 2 2 1 0 5 0 0 0nD 0 0 0 1 0 0 0 0R 3 3 11 0 0 0 33 0O 2 0 1 0 0 0 1 0

36

Page 81: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Error analysis

Sentiment and in-context word senses

• “wife” shows up as polygamous in older but monogamousin newer sources

• Redactor’s positive view of Aaron+Moses, violent story ofrebellion

Narrative continuity

• Abraham and Isaac story thought to originally end withsacrifice, changed by the Redactor

• “it was the season for grapes. ¡travel and geographiclocations¿ They broke off some grapes.”

37

Page 82: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Recipe for literary criticism

• Collect or construct useful resources from traditionalscholarship

• Determine fit of potential “compositional actions” toobserved document tree

• Choose the actions that are high-scoring and parsimonious

• Put the hypothesis in front of domain experts forverification/annotation

38

Page 83: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Ongoing work

Page 84: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Visualizing results

39

Page 85: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Assembling other example studies

• JHU history department’s “Entertaining America” (tabular)

• Northeastern U’s Women Writers Collection (XML/TEI)

• Targeted sentiment analysis (JSON)

• Tennyson’s poetic development (unconstrained text)

40

Page 86: Digital humanities: modeling semi-structured data from ...Outline Intro: A few thoughts on “Digital humanities” Motivating study: Post-Atlantic Slave Trade Model: Graph-Entity

Thanks!

Quick plug: come to David Mimno’s talk!

• Nov. 15 at noon (Hackerman B17)

• CS Professor at Cornell

• Rare CS faculty working in DH (topic modeling)

41