cultural text mining workshop

51
Cultural text mining Pim Huijnen Utrecht University DH Autumn School @ Uni Trier, October 1, 2015

Upload: pim-huijnen

Post on 12-Apr-2017

33 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Cultural text mining workshop

Cultural text mining

Pim HuijnenUtrecht University

DH Autumn School @ Uni Trier, October 1, 2015

Page 2: Cultural text mining workshop

Translantis: goals & methods

Critical Digital Humanities

Cultural text mining: Workflows

Assignments: looking for the right words

Page 3: Cultural text mining workshop

Translantis: goals & methods

Critical Digital Humanities

Cultural text mining: Workflows

Assignment: looking for the right words

Page 4: Cultural text mining workshop

www.translantis.nl

4

Page 5: Cultural text mining workshop

National Library The Hague

~9.000.000 digitized newspaper pages 1618 - 1995

5

Page 6: Cultural text mining workshop

Digital Humanities Approaches to Reference Cultures; The Emergence of the United States in Public Discourse in the Netherlands, 1890-1990

“…uses digital technologies to analyze the role of reference cultures in debates about social issues and collective identities, looking specifically at the emergence of the United States in public discourse in the Netherlands from the end of the nineteenth century to the end of the Cold War.

6

Page 7: Cultural text mining workshop

The United States as a reference culture

BusinessSocietyConsumptionMediaCrimeHealth

Page 8: Cultural text mining workshop

Translantis: goals & methods

Critical Digital Humanities

Cultural text mining: Workflows

Assignment: looking for the right words

Page 9: Cultural text mining workshop

Critical Digital Humanities

Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distant

Page 10: Cultural text mining workshop

Queries / title selection

Visibility of underlying

data

Variable time frame

Export function(csv)

Linguistic and statistical settings

histogram

Word cloud

Page 11: Cultural text mining workshop

BILAND

Query: ‘Heredity’ (1876) (22/1465 hits)

Page 12: Cultural text mining workshop

BILAND

Query: ‘Heredity’ (1935) (1465 hits)

Page 13: Cultural text mining workshop

Critical Digital Humanities

Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidence

Page 14: Cultural text mining workshop

Eploratory text mining

[R]igorous mathematics is not necessarilyessential for using data efficiently andeffectively. In particular, working with data can be playful and exploratory anddeliberately without the mathematical rigor that social scientists must use to support theirepistemological claims.

Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan

Press, 2013).

Page 15: Cultural text mining workshop

Exploratory text mining

In other words, data does not always have tobe used as evidence, but can be simply fordiscovering and framing research questions. […] [P]laying with data – in all its formats andforms – is more important than ever.

Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan

Press, 2013).

Page 16: Cultural text mining workshop

Critical Digital Humanities

Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidenceNo one size fits all solutions

Page 17: Cultural text mining workshop

Translantis: goals & methods

Critical Digital Humanities

Cultural text mining: Workflows

Assignment: looking for the right words

Page 18: Cultural text mining workshop

Digital workflows (1)

Export of subset as a csv file

Stripping file of redundant metadata

Upload files into Voyant for further analysis

Page 19: Cultural text mining workshop
Page 20: Cultural text mining workshop

Digital workflows (2)

Export of subset as a csv file

Stripping file of redundant metadata

Splitting csv into csv-per-year

Upload files into Voyant for further analysis

Page 21: Cultural text mining workshop

Digital workflows (2)

Subset “scientific management” (ca. 1200 articles 1919-1939): waning identification with Taylor(ism)

Page 22: Cultural text mining workshop

Digital workflows (3)

Export of subset as csv file

Stripping file of redundant metadata

Saving concordance file and upload it into Voyantfor further analysis

Getting concordances of a word in TextSTAT

Page 23: Cultural text mining workshop

Digital workflows (3)

Word cloud of concordances of “Amerika” in “scientific management” subset (1918-1939): words referring to other countries, to Taylor(ism), to work. Also note: “voorbeeld” (“model”), “oorlog” (“war”)

Page 24: Cultural text mining workshop

Constants computationalmethods in historical research

Time factor

Comparative perspective

Focus on language

Page 25: Cultural text mining workshop

Time factor

Page 26: Cultural text mining workshop

Comparative perspective

Page 27: Cultural text mining workshop

Focus on language

Capitalism ≠ “Capitalism”

Page 28: Cultural text mining workshop

Focus on language

Historical spelling variations

Ambiguity between word and concept

Ambiguity in word meaning

Page 29: Cultural text mining workshop

Overcoming language restrictions: dictionaries

Page 30: Cultural text mining workshop

Searching with large queries

Productie OR wetenschappelijk OR loon OR arbeid OR leiding OR systeem OR wetenschapOR taak OR methode OR stelsel OR studie OR kennis OR geschikt OR winst OR resultatenOR snelheid OR vermeerdering OR geld

Page 31: Cultural text mining workshop

But how to find the right words?

Page 32: Cultural text mining workshop

Translantis: goals & methods

Critical Digital Humanities

Cultural text mining: Workflows

Assignment: looking for the right words

http://bit.do/CTM_Trier

Page 33: Cultural text mining workshop

Using topic models to findcontext-specific words

Topic modeling (sets of) texts

Use (interesting) output topic modeling as (combinations of) keywords for further exploration

Page 34: Cultural text mining workshop

‘Moderne productie-beginselen’, Nieuwe Tilburgsche Courant, 20-11-1924

Page 35: Cultural text mining workshop

Gevonden voor “product* machine* verspilling bedrijfgoedkoop kwaliteit”\1900-1909

Het nieuws van den dag voorNederlandsch-Indië, 16-05-1903 Algemeen Handelsblad, 19-05-1906

Page 36: Cultural text mining workshop

Sub-corpus topic modeling

Page 37: Cultural text mining workshop

Topic modeling

Representing topics in collection of documents

Use statistics to find topics represented by groups of wordsDocument is a mix of topicsTopic is a mix of words

Documents and words can be directly observed, topics are latent

Page 38: Cultural text mining workshop

Topic modeling

Given a collection of documents, the modeling process does two things:

create word probability distribution for topicscreate topic probability distribution for documents

Both are purely based on frequency and co-occurrence of words

Page 39: Cultural text mining workshop

Mallet LDA

Mallet uses Latent Dirichlet AllocationReducing high-dimensional term vector space to low-dimensional 'latent' topic space Iterative sampling to establish topics, word-topic distribution and topic-document distribution

After so many iterations, distributions are stable

Page 40: Cultural text mining workshop

1) Gathering a corpus

Topic modeling (sets of) texts

Use (interesting) output topic modeling as (combinations of) keywords for further exploration

Page 41: Cultural text mining workshop

Mallet LDA

Project Gutenberg

https://www.gutenberg.org

Hathi Trust Digital Library

https://www.hathitrust.org

Page 42: Cultural text mining workshop

2) Topic modeling

Topic modeling (sets of) texts

Use (interesting) output topic modeling as (combinations of) keywords for further exploration

Page 43: Cultural text mining workshop

Mallet LDA

Mallet command line

http://mallet.cs.umass.edu

Mallet GUI

https://code.google.com/p/topic-modeling-tool/

Page 44: Cultural text mining workshop

Mallet GUI settings

Stopword list

No. of iterations

No. of topics

No. of words per topic

(Splitting of corpus)

Page 45: Cultural text mining workshop

Mallet GUI output

Page 46: Cultural text mining workshop

3) Further exploration

Topic modeling (sets of) texts

Use (interesting) output topic modeling as (combinations of) keywords for further exploration

Page 47: Cultural text mining workshop

Use interesting words to build a dictionary

Page 48: Cultural text mining workshop

Word use in specific corpora

Chronicling Americahttp://chroniclingamerica.loc.gov

Europeana Newspaper Projecthttp://www.theeuropeanlibrary.org/tel4/access

Dutch historical newspapershttp://www.delpher.nl

http://corpus.byu.edu

Page 49: Cultural text mining workshop

Word frequencies, collocations, etc.

AntConchttp://www.laurenceanthony.net/software.html

Taporwarehttp://taporware.ualberta.ca

TextSTAThttp://neon.niederlandistik.fu-berlin.de/en/textstat/

Page 50: Cultural text mining workshop

Word use over time

Google Books ngram viewerhttps://books.google.com/ngrams

NY Times ngram viewerhttp://chronicle.nytlabs.com

Chronicling America ngram viewerhttp://bookworm.culturomics.org/ChronAm/

Page 51: Cultural text mining workshop

Word use over time