cultural text mining workshop
TRANSCRIPT
Cultural text mining
Pim HuijnenUtrecht University
DH Autumn School @ Uni Trier, October 1, 2015
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignments: looking for the right words
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
www.translantis.nl
4
National Library The Hague
~9.000.000 digitized newspaper pages 1618 - 1995
5
Digital Humanities Approaches to Reference Cultures; The Emergence of the United States in Public Discourse in the Netherlands, 1890-1990
“…uses digital technologies to analyze the role of reference cultures in debates about social issues and collective identities, looking specifically at the emergence of the United States in public discourse in the Netherlands from the end of the nineteenth century to the end of the Cold War.
6
The United States as a reference culture
BusinessSocietyConsumptionMediaCrimeHealth
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
Critical Digital Humanities
Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distant
Queries / title selection
Visibility of underlying
data
Variable time frame
Export function(csv)
Linguistic and statistical settings
histogram
Word cloud
BILAND
Query: ‘Heredity’ (1876) (22/1465 hits)
BILAND
Query: ‘Heredity’ (1935) (1465 hits)
Critical Digital Humanities
Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidence
Eploratory text mining
[R]igorous mathematics is not necessarilyessential for using data efficiently andeffectively. In particular, working with data can be playful and exploratory anddeliberately without the mathematical rigor that social scientists must use to support theirepistemological claims.
Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan
Press, 2013).
“
Exploratory text mining
In other words, data does not always have tobe used as evidence, but can be simply fordiscovering and framing research questions. […] [P]laying with data – in all its formats andforms – is more important than ever.
Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan
Press, 2013).
“
Critical Digital Humanities
Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidenceNo one size fits all solutions
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
Digital workflows (1)
Export of subset as a csv file
Stripping file of redundant metadata
Upload files into Voyant for further analysis
Digital workflows (2)
Export of subset as a csv file
Stripping file of redundant metadata
Splitting csv into csv-per-year
Upload files into Voyant for further analysis
Digital workflows (2)
Subset “scientific management” (ca. 1200 articles 1919-1939): waning identification with Taylor(ism)
Digital workflows (3)
Export of subset as csv file
Stripping file of redundant metadata
Saving concordance file and upload it into Voyantfor further analysis
Getting concordances of a word in TextSTAT
Digital workflows (3)
Word cloud of concordances of “Amerika” in “scientific management” subset (1918-1939): words referring to other countries, to Taylor(ism), to work. Also note: “voorbeeld” (“model”), “oorlog” (“war”)
Constants computationalmethods in historical research
Time factor
Comparative perspective
Focus on language
Time factor
Comparative perspective
Focus on language
Capitalism ≠ “Capitalism”
Focus on language
Historical spelling variations
Ambiguity between word and concept
Ambiguity in word meaning
Overcoming language restrictions: dictionaries
Searching with large queries
Productie OR wetenschappelijk OR loon OR arbeid OR leiding OR systeem OR wetenschapOR taak OR methode OR stelsel OR studie OR kennis OR geschikt OR winst OR resultatenOR snelheid OR vermeerdering OR geld
But how to find the right words?
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
http://bit.do/CTM_Trier
Using topic models to findcontext-specific words
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
‘Moderne productie-beginselen’, Nieuwe Tilburgsche Courant, 20-11-1924
Gevonden voor “product* machine* verspilling bedrijfgoedkoop kwaliteit”\1900-1909
Het nieuws van den dag voorNederlandsch-Indië, 16-05-1903 Algemeen Handelsblad, 19-05-1906
Sub-corpus topic modeling
Topic modeling
Representing topics in collection of documents
Use statistics to find topics represented by groups of wordsDocument is a mix of topicsTopic is a mix of words
Documents and words can be directly observed, topics are latent
Topic modeling
Given a collection of documents, the modeling process does two things:
create word probability distribution for topicscreate topic probability distribution for documents
Both are purely based on frequency and co-occurrence of words
Mallet LDA
Mallet uses Latent Dirichlet AllocationReducing high-dimensional term vector space to low-dimensional 'latent' topic space Iterative sampling to establish topics, word-topic distribution and topic-document distribution
After so many iterations, distributions are stable
1) Gathering a corpus
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
Mallet LDA
Project Gutenberg
https://www.gutenberg.org
Hathi Trust Digital Library
https://www.hathitrust.org
2) Topic modeling
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
Mallet LDA
Mallet command line
http://mallet.cs.umass.edu
Mallet GUI
https://code.google.com/p/topic-modeling-tool/
Mallet GUI settings
Stopword list
No. of iterations
No. of topics
No. of words per topic
(Splitting of corpus)
Mallet GUI output
3) Further exploration
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
Use interesting words to build a dictionary
Word use in specific corpora
Chronicling Americahttp://chroniclingamerica.loc.gov
Europeana Newspaper Projecthttp://www.theeuropeanlibrary.org/tel4/access
Dutch historical newspapershttp://www.delpher.nl
http://corpus.byu.edu
Word frequencies, collocations, etc.
AntConchttp://www.laurenceanthony.net/software.html
Taporwarehttp://taporware.ualberta.ca
TextSTAThttp://neon.niederlandistik.fu-berlin.de/en/textstat/
Word use over time
Google Books ngram viewerhttps://books.google.com/ngrams
NY Times ngram viewerhttp://chronicle.nytlabs.com
Chronicling America ngram viewerhttp://bookworm.culturomics.org/ChronAm/
Word use over time