2010 digital humanities london - dutch republic of letters

scholarlycommunication@ 1650

scholarlycommunication@ 2050

Letters, Ideas and Information Technology

Erik-Jan Bos, Univ. Utrecht, [email protected]

Charles van den Heuvel, VKS,[email protected]

Dirk Roorda (that’s me), DANS,[email protected]

Using digital corpora of letters to disclose the circulation of

knowledge in the 17th century

mailto:[email protected]



http://ckcc.huygens.knaw.nl/

Nota

Beeckman

Cats STEVIN

Huygens STEVIN

Langeren

relation disciplines

direct - water

indirect - literature

4

Corpora of17th century scholars

Corpora of17th century scholars

Constantijn Huygens Christiaan Huygens Grotius Descartes Swammerdam Leeuwenhoek Barleaus Spinoza and more?

Corpus Number of letters:

In posession?

Format Metadata Normalized?

Grotius 7946 Yes TEI In Interp element

Yes, DBNL codes

Van Leeuwenhoek

337 Yes TEI In Interp element

Yes, DBNL codes

Descartes 750 Yes XML (no TEI)

other markup

No, plain text

Barlaeus 1200 300 ready Word unknown unknown

Swammerdam 80 Yes Word unknown unknown

Constantijn Huygens

7295 Yes xml Probably Interp element

DBNL codes

Christiaan Huygens

2900? Medio 2010 probably TEI

Probably Interp element

DBNL codes

CEN -MetadataCEN -Metadata

Catalogus Epistularum Neerlandaricum265,000 descriptions of approximately 1,000,000 lettersfrom 1600 – now of which100,000 letters in 17th century

Research Questions

• History of science:• How did knowledge circulate in the 17th-

century Dutch Republic?

• Patterns in knowledge growth:• How can we visualise sets of letters that

exhibit features of knowledge circulation?

• Re-use:• How can we expose the sources, annotations,

and resulting patterns to further research?

Challenge

Traditional scholarship• interpretation• close reading• solving puzzles

East is east and

East

WestComputational methods•dealing with patterns•gleaned from large quantities of texts•by automatic tools

West is west and ...

Issues to deal with

• making the sources uniformly available• well coded in TEI, access rights

• overcoming the language barrier • (17th cent varieties of French, Latin, Dutch)

• named entity recognition & concepts• people, places, dates, concepts, instruments• mixture of interpretation and algorithms

• creating useful visualisations• aiding exploration by historians of science

ICT in Humanities Research

• collaboratory• e-Laborate as starting point

• algorithmic pipelines• from source material to visualisation

• infrastructure• archiving results• re-using data• developing new algorithms• disseminating the methodology

collaboratorycollaboratory

pipelines

pipelines (current)

• language detection, usingLanguage Identification from Text Using N-gram Based

Cumulative Frequency Addition

Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert 2004

• results

pipelines (current)• spelling normalisation

• VARD (http://www.comp.lancs.ac.uk/~barona/vard2/)• with help from (http://www.dicollecte.org/home.php?prj=fr)

• results• French: VARD works (after improvements),

although designed for historical English• Dutch: still on the lookout for a combination of

resources, tools, and dexterity• Latin: later

pipelines (current)

pipelines (current)

• named entity recognition• known tools get 70%• search for optimal tools in the next stage

pipelines (insights)

• expect the most from statistical methods

• language technology may boost results

• it remains to be seen by how much

Topic-Author-TimeTopic-Author-TimeSource: Scott Weingart UIA

infrastructure

the project’s legacy

• more than publications• curated sources, annotations, visualisations

• more than algoritms• a framework for analysis of historical texts

• more than a piece of historical research• data and (intermediate) results worthwhile to

• linguists, computer scientists, sociologists

• more than a passive dataset• extensible, dynamic, interactive

preserving the results

• part of the CLARIN infrastructure• http://www.clarin.eu/ • http://www.clarin.nl/

• materials in a Trusted Digital Repository (DANS)• http://easy.dans.knaw.nl/dms

working with CLARIN

• CLARIN-EU• Outreach to humanities: use cases• CKCC one of 10 selected projects• received expert input for choice of language

tools

• CLARIN-NL• CKCC one of 10 initial projects in the Dutch

national construction effort• support for applying language technology

Adapting to CLARIN

• Conforming to standards

• CLARIN standards are in evolution• (and will remain evolvable)

• Common MetaData Infrastructure• a registry of metadata components• defined by the community• with explicit semantics (http://www.isocat.org/ )

• Data in TEI (as export/import format)

Trusted Digital Repository

• materials• reliable (provenance metadata) • findable (CMDI metadata)• referable (persistent identifiers)• accessible (viewable in webbrowser)• usable (downloadable)

• sooner or later: • high-performance computing• memento: a time-sensitive webinterface to the

dynamic contents of the collaboratory (http://arxiv.org/abs/0911.1112 )

http://www.clarin.eu/node/3073

http://ckcc.huygens.knaw.nl/

2010 digital humanities london - dutch republic of letters

Education

clarin infrastructure

pipelines insights

explicit semantics http

results french

clarin clarineu outreach

results re

clarin conforming

standards clarin standards