diaview: visualise cultural change in diachronic corpora, david beavan, ucldh, dh2012

Post on 22-Nov-2014

633 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk given at Digital Humanities 2012 (DH2012) in Hamburg, Germany on 18 July 2012. Web site: http://www.scottishcorpus.ac.uk/corpus/diaview/ Video: http://lecture2go.uni-hamburg.de/konferenzen/-/k/13916 Abstract: http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/diaview-visualise-cultural-change-in-diachronic-corpora/ This paper will introduce and demonstrate DiaView, a new tool to investigate and visualise word usage in diachronic corpora. DiaView highlights cultural change over time by exposing salient lexical items from each decade or year, and providing them to the user in an effortless visualisation. This is made possible by examining large quantities of diachronic textual data, in this case the Google Books corpus (Michel et al. 2010) of one million English books. This paper will introduce the methods and technologies at its core, perform a demonstration of the tool and discuss further possibilities.

TRANSCRIPT

DiaView:Visualise Cultural Change in Diachronic Corpora

David BeavanUCL Centre for Digital Humanities

@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview

Google Books corpus/Ngram Viewer

http://books.google.com/ngrams/

Google Books corpus

• OCR quality variable, particularly poor in 1700s(difficulties with long-s: ſ )

• Does not evenly sample across genres(data collection fairly opportunistic)

• Chronological placement questionable(implicit metadata not always correct)

• Very large data set(155 billion tokens)

DiaView uses

• English One Million corpus“Books with low OCR quality were removed, and serials were removed.”

• 1850 to present(avoids long-s)

• 98 billion tokens(still very large)

• Filter out very infrequently used words(or keep large sample of most frequently used)

DiaView concept

• Quick and easy to use• Aggregate and summarise data• Promote browsing and opportunistic discovery• Help identify cultural trends across time• Highlight salient or ‘interesting’ terms• Provide links to more in-depth analysis• Inspect corpus by decade or year• Ability to work with any corpora or any dataset

DiaView method/measuring salience

Proportion of term occurrences inentire corpus

vs

Proportion of term occurrences inparticular year

Word ‘and’

100 of 1000 words in entire corpus is ‘and’ = 10%

Year 1 45 of 500 words = 9% = -10% of corpus proportion (10%)Year 2 55 of 500 words = 11% = +10% of corpus proportion (10%)

Word ‘sausage’

20 of 1000 words in entire corpus is ‘sausage’ = 2%

Year 1 4 of 500 words = 0.2% = -90% of corpus proportion (2%)Year 2 16 of 500 words = 3.2% = +60% of corpus proportion (2%)

Rank for salience by year, ignoring underuse (not negative %ages)

Year 1 -Year 2 ‘sausage’ (+60%), ‘and’ (+10%)

DiaView method

• Word frequency alone does not dictate salience(extraordinary over use does)

• Traverse entire corpus by year/decade• Calculate salience for each type• Rank types according to salience• Apply visual style to word lists• Create links back to Ngram Viewer

for in-depth analysis

www.scottishcorpus.ac.uk/corpus/diaview

www.scottishcorpus.ac.uk/corpus/diaview

DiaView:Visualise Cultural Change in Diachronic Corpora

David BeavanUCL Centre for Digital Humanities

@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview

top related