cultural text mining workshop

Cultural text mining

Pim HuijnenUtrecht University

DH Autumn School @ Uni Trier, October 1, 2015

Translantis: goals & methods

Critical Digital Humanities

Cultural text mining: Workflows

Assignments: looking for the right words




Assignment: looking for the right words

www.translantis.nl

4

National Library The Hague

~9.000.000 digitized newspaper pages 1618 - 1995

5

Digital Humanities Approaches to Reference Cultures; The Emergence of the United States in Public Discourse in the Netherlands, 1890-1990

“…uses digital technologies to analyze the role of reference cultures in debates about social issues and collective identities, looking specifically at the emergence of the United States in public discourse in the Netherlands from the end of the nineteenth century to the end of the Cold War.

6

The United States as a reference culture

BusinessSocietyConsumptionMediaCrimeHealth


Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distant

Queries / title selection

Visibility of underlying

data

Variable time frame

Export function(csv)

Linguistic and statistical settings

histogram

Word cloud

BILAND

Query: ‘Heredity’ (1876) (22/1465 hits)

BILAND

Query: ‘Heredity’ (1935) (1465 hits)


Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidence

Eploratory text mining

[R]igorous mathematics is not necessarilyessential for using data efficiently andeffectively. In particular, working with data can be playful and exploratory anddeliberately without the mathematical rigor that social scientists must use to support theirepistemological claims.

Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan

Press, 2013).

“

Exploratory text mining

In other words, data does not always have tobe used as evidence, but can be simply fordiscovering and framing research questions. […] [P]laying with data – in all its formats andforms – is more important than ever.

Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan

Press, 2013).

“


Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidenceNo one size fits all solutions

Digital workflows (1)

Export of subset as a csv file

Stripping file of redundant metadata

Upload files into Voyant for further analysis


Export of subset as a csv file


Splitting csv into csv-per-year

Upload files into Voyant for further analysis


Subset “scientific management” (ca. 1200 articles 1919-1939): waning identification with Taylor(ism)


Export of subset as csv file


Saving concordance file and upload it into Voyantfor further analysis

Getting concordances of a word in TextSTAT


Word cloud of concordances of “Amerika” in “scientific management” subset (1918-1939): words referring to other countries, to Taylor(ism), to work. Also note: “voorbeeld” (“model”), “oorlog” (“war”)

Constants computationalmethods in historical research

Time factor

Comparative perspective

Focus on language

Time factor

Comparative perspective

Focus on language

Capitalism ≠ “Capitalism”

Focus on language

Historical spelling variations

Ambiguity between word and concept

Ambiguity in word meaning

Overcoming language restrictions: dictionaries

Searching with large queries

Productie OR wetenschappelijk OR loon OR arbeid OR leiding OR systeem OR wetenschapOR taak OR methode OR stelsel OR studie OR kennis OR geschikt OR winst OR resultatenOR snelheid OR vermeerdering OR geld

But how to find the right words?





http://bit.do/CTM_Trier

Using topic models to findcontext-specific words

Topic modeling (sets of) texts

Use (interesting) output topic modeling as (combinations of) keywords for further exploration

‘Moderne productie-beginselen’, Nieuwe Tilburgsche Courant, 20-11-1924

Gevonden voor “product* machine* verspilling bedrijfgoedkoop kwaliteit”\1900-1909

Het nieuws van den dag voorNederlandsch-Indië, 16-05-1903 Algemeen Handelsblad, 19-05-1906

Sub-corpus topic modeling

Topic modeling

Representing topics in collection of documents

Use statistics to find topics represented by groups of wordsDocument is a mix of topicsTopic is a mix of words

Documents and words can be directly observed, topics are latent

Topic modeling

Given a collection of documents, the modeling process does two things:

create word probability distribution for topicscreate topic probability distribution for documents

Both are purely based on frequency and co-occurrence of words

Mallet LDA

Mallet uses Latent Dirichlet AllocationReducing high-dimensional term vector space to low-dimensional 'latent' topic space Iterative sampling to establish topics, word-topic distribution and topic-document distribution

After so many iterations, distributions are stable

1) Gathering a corpus



Mallet LDA

Project Gutenberg

https://www.gutenberg.org

Hathi Trust Digital Library

https://www.hathitrust.org

2) Topic modeling



Mallet LDA

Mallet command line

http://mallet.cs.umass.edu

Mallet GUI

https://code.google.com/p/topic-modeling-tool/

Mallet GUI settings

Stopword list

No. of iterations

No. of topics

No. of words per topic

(Splitting of corpus)

Mallet GUI output

3) Further exploration



Use interesting words to build a dictionary

Word use in specific corpora

Chronicling Americahttp://chroniclingamerica.loc.gov

Europeana Newspaper Projecthttp://www.theeuropeanlibrary.org/tel4/access

Dutch historical newspapershttp://www.delpher.nl

http://corpus.byu.edu

Word frequencies, collocations, etc.

AntConchttp://www.laurenceanthony.net/software.html

Taporwarehttp://taporware.ualberta.ca

TextSTAThttp://neon.niederlandistik.fu-berlin.de/en/textstat/

Word use over time

Google Books ngram viewerhttps://books.google.com/ngrams

NY Times ngram viewerhttp://chronicle.nytlabs.com

Chronicling America ngram viewerhttp://bookworm.culturomics.org/ChronAm/

Word use over time

cultural text mining workshop

Education