tracking the emergence of new words across time and space
TRANSCRIPT
Tracking the Emergence of New Words across Time and Space
Jack GrieveAston University
Research conducted with Diansheng Guo & Alice Kasakoff, University of South CarolinaAndrea Nini, Aston University
Funded as part of the Digging into Data Challenge
Approaches to Historical Linguistics
There are several different approaches to the analysis of
language change:
Reconstruction through comparison of known languages (comparative method)
Analysis of previous linguistic research (e.g. lexicographic research)
Analysis of historical texts (corpus-based)
Apparent time studies with interview data (sociolinguistics)
Computer simulations
Lexical Change
Research in historical linguistics and etymology has
analysed how the usage of certain words have changed
over relatively long periods of time (primarily based on
historical corpora and lexicographic research), but overall
there are large gaps in our knowledge of lexical change,
including how newly emerging words enter a language
and spread across its speakers.
Words are Rare Events
The main problem with studying lexical variation and
change is that most words are incredibly rare, thus
requiring incredibly large corpora of natural language.
This is why most research on lexical variation and
change has focused on relatively high frequency words,
primarily function words (e.g. pronouns, prepositions,
auxiliary verbs).
Word Frequency Distribution (Zipf 1935, 1945)
Word Frequency Distribution (Zipf 1935, 1945)
The majority of the 67,000 most frequent words in our corpus occur less than once per 25 million words
Word Frequency Distribution (Zipf 1935, 1945)
New Words are Incredibly Rare Events
The analysis of new words requires even more data,
because emerging words are by definition especially
rare.
In addition, to analyse the temporal and spatial spread
of new words, large corpora must be compiled for a
large number of points in times and locations.
Big Data
Suitable data has recently become available with the
rise of the social media and smartphones, which
provide massive amounts of time-stamped and geo-
coded natural language data.
Goals of Today’s Talk
Identify emerging words from 2014 based on a multi-
billion word corpus of American tweets.
Chart their usage over time and identify common
temporal patterns of lexical spread.
Map their geographical diffusion and identify common
spatial patterns of lexical spread.
The Corpus
Since 2013, the team at USC have been compiling two
multi-billion word geocoded corpora for the US and the UK
using the Twitter API.
Twitter is a particularly rich source of geocoded data and
is also very popular, informal, and youthful, making it ideal
for tracking the emergence of new words.
Approximately 2% of tweets are geocoded.
The Corpus
The analysis today is based on a 8.9 billion word
corpus of American Tweets from October 2013-
November 2014, which totals approximately 980 million
Tweets from 7 million users.
Every tweet is geocoded with the precise longitude and
latitude of the user when posting, which were then used
to identify the county where each Tweet was produced.
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
-‐87.684555,42.074043Just posted a photo @ Baha'i House of Worship
Corpus Examples
username,fips,time,tweet-‐,48439,Sun Jul 27 23:59:59 EDT 2014,don't follow the right ppl lol-‐,42007,Sun Jul 27 23:59:59 EDT 2014,yesss moody judy-‐,36005,Sun Jul 27 23:59:59 EDT 2014,Man i was just thinking shexx be lurking but won't hmu-‐,25021,Sun Jul 27 23:59:59 EDT 2014,no seeing u on tv is reel but not seeing u on twitter is real for me...so pls visit us here everyday.-‐,26163,Sun Jul 27 23:59:59 EDT 2014,Hate seeing my friends sad-‐,12093,Sun Jul 27 23:59:59 EDT 2014,this is the shirt i won that i got to sign btw!!:)
Graveyard/Cemetery
Graveyard/Cemetery
Graveyard/Cemetery Percent
Graveyard/Cemetery Smoothed (Getis-Ord Gi)
Identifying Rising Words
To find newly emerging words, we first measured the
degree to which the usage of each word in the corpus
had been rising over the 13 month period.
To identify these rising words we extracted the 67,000
words that occur at least 1,000 times in the corpus and
compared word relative frequency per day to day of the
year using a Spearman’s rank correlation coefficient.
ρ = .044
ρ = .116
ρ = .044
ρ = .044ρ = -.028
The Top 10 Rising Words on Twitter 2014
Word ρ Definitionfuckboy 0.947 Asshole, Jerk, Poser, Tool, etc.rn 0.938 Right Now (Top Riser 2013)hbd 0.928 Happy Birthdayfw 0.927 Fuck withunbothered 0.926 Unconcerned & Disengagedft 0.925 Face timegmfu 0.924 Get me fucked upsm 0.919 So Muchsquad 0.919 Squadasf 0.918 As fuck
Identifying Emerging Words
Although measuring correlations allows for rising words
to be identified, most are far too common by 2014 to
show patterns of regional spread.
To identify emerging words we cross-referenced the list
of rising words against a list of rare words, defined as
words with low overall frequencies in the fourth quarter
of 2013 (excluding proper nouns).
Top 10 Emerging Words on Twitter 2014
Words ρ Definitionunbothered 0.926 Unconcerned & Disengagedgmfu 0.924 Get Me Fucked Upjoggers 0.908 Jogging pantsfuckboys 0.902 Losers, wimps, posers, etc.rekt 0.900 Wreckedtfw 0.879 That feel whenxans 0.878 Benzodiazepine pillsbaeless 0.875 To be without a baeboolin 0.857 Hanging out, esp. young menlordt 0.854 Lord, as exclamation
Top 11-20 Emerging Words on Twitter 2014
Words ρ Definitioncelfie 0.852 selfieslays 0.843 impresses, succeeds at, etc.famo 0.840 family and friendsfuckboi 0.838 fuckboy(on) fleek 0.838 on point, esp. eyebrowsfaved 0.836 to favorite somethinggainz 0.828 earningsbruuh 0.817 broamirite 0.816 am I rightnotifs 0.808 notifications, especially online
http://www.google.co.uk/trends/explore#q=unbothered
S-shaped Curves
In the time charts for many of the rising and emerging
words we see clear s-curves or what look like the start
of s-curves.
S-shaped Curves
Similar results have also been found repeatedly in
sociolinguistic apparent time studies (see Labov, 2001),
as well as in corpus-based research in historical
linguistics (e.g. Nevalainen & Raumolin-Brunberg, 2003).
Similar results have also been obtained in research on
the diffusion of innovations (see Rogers, 2003), where it
is referred to as an S-shaped Curve of Diffusion.
https://www.uni-due.de/SHE/S-Curve.JPG
Summary: Time Patterns
New words rise (and fall) very quickly in Modern
English, with numerous new words entering the
language and quickly rising in usage every year.
The usage of emerging words over time tends to follow
an s-shaped curve, echoing results found in
sociolinguistic apparent time studies and diffusion of
innovation research.
Goals of Today’s Talk
Identify emerging words from 2014 based on a multi-
billion word corpus of American tweets.
Chart their usage over time and identify common
temporal patterns of lexical spread.
Map their geographical diffusion and identify common
spatial patterns of lexical spread.
Mapping the Spread of New Words
An important technical problem is how to map the
spread of a new word across a region.
One approach is to map the relative frequency (e.g.
occurrences per million words) of the word across a
series of regional corpora (e.g. all the tweets from a
particular county) over a series of time points.
Geographical Diffusion of Linguistic Forms
Two major theories have been proposed to explain how
new linguistic forms generally spread in language:
The Wave Model states that new forms spread out
radially from their source.
The Gravity Model states that new forms spread out
from one urban area to the next, based on distance
and population size, only later filling in less
populated areas in between.
Assessing the Wave and Gravity Models
We can begin assess the validity of the wave and
gravity models for lexical spread by comparing the
spread of unbothered.
This analysis can be facilitated by focusing on one state
where the form eventually becomes relatively common,
for example Georgia.
Atlanta
Columbus
Macon
Augusta
Savannah
Population Density of Georgia
Atlanta
Columbus
Macon
Augusta
Savannah
01 November 2013
Atlanta
Columbus
Macon
Augusta
Savannah
01 December 2013
Atlanta
Columbus
Macon
Augusta
Savannah
01 January 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 February 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 March 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 April 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 May 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 June 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 July 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 August 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 September 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 October 2014
Atlanta
Columbus
Macon
Augusta
Savannah
01 November 2014
Assessing the Wave and Gravity Models
The geographical spread of unbothered in Georgia
appears to be more complex than predicted by the
Wave or Gravity Model, although both appear to offer a
partial explanation for this pattern of spread
The percentage of African Americans, however, also
appears to be an important predictor.
African Americans in Georgia
Atlanta
Columbus
Macon
Augusta
Savannah
Atlanta
Columbus
Macon
Augusta
Savannah
01 November 2014
01 November 2014
Atlanta
Columbus
Macon
Augusta
Savannah
Presenting a time series of maps is an effective way to
map lexical spread, but another technical issue is how
to map emerging words on one map:
Relative frequency
Date of first (or second...) occurrence
Number of words until first (or second...) occurrence
Mapping the Spread of New Words on One Map
Top 10 Emerging Words on Twitter 2014
Words ρ Definitionunbothered 0.926 Unconcerned & Disengagedgmfu 0.924 Get Me Fucked Upjoggers 0.908 Jogging pantsfuckboys 0.902 Losers, wimps, posers, etc.rekt 0.900 Wreckedtfw 0.879 That feel whenxans 0.878 Benzodiazepine pillsbaeless 0.875 To be without a baeboolin 0.857 Hanging out, esp. young menlordt 0.854 Lord, as exclamation
Top 11-20 Emerging Words on Twitter 2014
Words ρ Definitioncelfie 0.852 selfieslays 0.843 impresses, succeeds at, etc.famo 0.840 family and friendsfuckboi 0.838 fuckboy(on) fleek 0.838 on point, esp. eyebrowsfaved 0.836 to favorite somethinggainz 0.828 earningsbruuh 0.817 broamirite 0.816 am I rightnotifs 0.808 notifications, especially online
Summary: Regional Patterns
New words originate from across the US, including the
Southeast (e.g. Unbothered, Baeless, Boolin), the North
(e.g. Fuckboy, Gainz), and the West (e.g. Wrekt), and
tend to spread within these regions first.
Otherwise, the spread of new words appears to be highly
complex, affected by numerous factors, including
proximity, population density, and demographic patterns.
Traditional Approaches to Historical Linguistics
The empirical analysis of language change is generally
based on historical corpora, which tend to span
centuries, or collections of linguistic interviews, which
tend to span generations (i.e. based on apparent time).
Both sources of data tend to provide a broad temporal
scope but limited temporal resolution and amounts of
data (<1 million words).
The Uniformitarian Principle
“Knowledge of processes that operated in the past can
be inferred by observing ongoing processes in the
present” (Christy, 1983: ix).
This Uniformitarian Principle is cited in Labov (2001) to
justify the use of apparent time interview data in place of
historical corpora, but it also justifies the use of
extremely large and dense contemporary corpora in
place of both of these more common approaches.
A Modern Approach to Historical Linguistics
Analysing with modern language data mined from online
sources allows for unprecedentedly large, rich and
dense natural language corpora to be compiled.
Although historical scope is lost, this approach allows for
language change to be analysed in far greater detail
than would otherwise be possible.
Tracking the Emergence of New Words across Time and Space
Jack GrieveCentre for Forensic LinguisticsAston University
Email: [email protected]: https://sites.google.com/site/jackgrieveastonTwitter: @JWGrieve