tools and methods for processing and visualizing large corpora · 2016. 1. 21. · tools and...
TRANSCRIPT
Tools and Methods for Processing and Visualizing
Large Corpora
Gerold Schneider, Unversity of Zurich and University of Konstanz Mennatallah El-Assady, University of Konstanz
Hans Martin Lehmann, University of Zurich
d2e Conference, Helsinki
Page 1
Overview and Contents
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 2
Several approaches and methods which we develop or use to create workflows from data to evidence.
1. Do-it-yourself from words via numbers to trends as we use and develop them at the English Department in Zurich
• Searching specific instances: Dependency Bank
• Data-Driven Overuse: Lightside • Visualize your trends: GoogleViz
2. Interactive Visualizations as we use and develop them at the Data Analysis and Visualization Group at the University of Constance
• Topic Matrix View • Statistical Visualizations: Tableau
• Lexical Episode Plot
1.1 DYI d2e: Dependency Bank
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 3
• Search specific instances & sum by period/genre/etc.: Dependency Bank e.g. ‘education system’ vs. ‘system of education’
1.1 DYI d2e: Dependency Bank
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 4
‘education system’ vs. ‘system of education’. Trend confirmed by Google Books Noun-noun compounds are generally increasing
Relative Frequency of open-form noun-noun compounds
University of Zurich, English Department, Hans Martin Lehmann Page 5
confirms and extends Leech et al. 2009
Syntactic query: stative verbs in the progressive
“I’m loving it” in the BNC
6
1.2 DIY d2e: Data-driven Overuse
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 7
Machine-Learning for the Masses: Lightside http://lightsidelabs.com/what/research/ Allows you to do Machine Learning without programming skills. E.g. can we classify American speeches to republican/democrat? What are their typical linguistic features?
We use CORPS II corpus: 8 mio words, 3618 speeches (Guerini et al. 2013).
To do doc classfication with Lightside you must • Create TAB-separated input.
• Understand the broad idea of Naïve Bayes or regression, just because
• you need to interpret the results • you don’t want to crash your computer
• Be patient during the massive calculations
1.2 DIY d2e: Data-driven Overuse: Lightside
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 8
• Create TAB-separated input.
• Understand the broad idea of Naïve Bayes or regression, just because
• you need to interpret the results • you don’t want to crash your computer
Automated Media content analysis, Gerold Schneider Seite 9
1.2 DIY d2e: Data-driven Overuse: Lightside
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 10
• you need to interpret the results
1.3 DIY d2e: Google Visualization
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 11
• Back to noun-noun compounds: Relative frequency of the Alternation
1.3 DIY d2e: Google Visualization
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 12
Back to noun-noun compounds: Relative frequency of the alternation in COHA You need basic programming skills in R Google Viz is an R library
There are excellent instructions by Martin Hilpert: thanks!
2. Interactive Visualizations We use and develop Interactive Visualizations at the Data Analysis and Visualization Group at the University of Constance.
Visualization for Digital Humanities A main driving force for visualization in linguistics are Digital Humanities projects. Visualization is needed because the massive amounts of information can’t easily be viewed or understood using the traditional method of reading the texts.
Idea: Spot concepts, and zoom in to read the interesting parts.
And how do we spot concepts? By looking at keywords, such as words that are overused in particular documents/sections/paragraphs.
Unfortunately, the mapping from words to concept is not 1:1 • The same word can mean many things • Different words can refer to the same concept • Meanings change over time (Tony McEnery’s plenary) • Proper names are often just actors and witnesses in bigger concepts.
Vita brevis, arma longa Noun-noun compounds suffer less from these, they denote (new) concepts
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 13
2.1 Firthian Hypothesis and Topic Models Unfortunately, the mapping from words to concept is not 1:1
• The same word can mean many things: but contexts disambiguate • Different words can refer to the same concept. But only if they are in similar
contexts • Word senses change over time: need data-driven, context-based methods
“words with similar distributional properties have similar meanings” (Sahlgren, 2006: 21)
• Words in immediate Context ! Collocations (syntagmatic axis) • Words in larger Context ! semantic associations, topics (paradigmatic axis)
Topics are clusters of words that frequently co-occur. Many approaches to topic modelling (e.g. LDA) use a probability distribution model maximizing
We apply a deterministic topic model
approach (IHTM) to of 60.000 news articles from 1860 to 2000 (COHA)
Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 14
p(topic | document) ⋅ p(word | topic)
2.1 Topic Matrix View The First Moon Missions
lunar module moon orbit command flight space land surface spacecraft rocket craft mission ship walk
2.1 Selected topics and their keywords
Gerold Schneider, Mennatallah El-Assady, Hans Martin Lehmann Page 17
Topics of War and Peace
2.2 Topic Evolution over Time
Topic Evolution over Time
2.3 Lexical Episodes Plots
Distant Reading
Zooming and Highlighting
Close Reading 21
Index 4 Index 17 Index 23 Index 94
Actual Distribution
Equidistance Distribution
25
Lexical Episodes
= portion within the word sequence of a corpus where a certain word appears more densely than expected from its frequency in the whole text.
episode
100
22
Lexical Episodes
100
23
Chapter Title
11.11.15
University of Zurich, Division/Office, Title of the presentation, Author Page 24
Gerold Schneider, Mennatallah El-Assady, Hans Martin Lehmann Page 25
Prohibition in the United States was a nationwide constitutional ban on the sale, production, importation, and transportation of alcoholic beverages that remained in place from 1920 to 1933. https://en.wikipedia.org/wiki/Unemployment_and_Farm_Relief_Act 1930 some pointers also to the great depression radio broadcast peak in 1930s Hydrogen bomb: 1 small peak in 1945 world peace from 1945 on tax income revenue lat 1940's: irgendetwas passierte da. government coalition early 1950's : https://en.wikipedia.org/wiki/Attlee_ministry ?? Sports 1990s TV shows 1990s