tracking the emergence of new words across time and space

Post on 16-Jul-2015

1.073 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tracking the Emergence of New Words across Time and Space

Jack GrieveAston University

Research conducted with Diansheng Guo & Alice Kasakoff, University of South CarolinaAndrea Nini, Aston University

Funded as part of the Digging into Data Challenge

Approaches to Historical Linguistics

There are several different approaches to the analysis of

language change:

Reconstruction through comparison of known languages (comparative method)

Analysis of previous linguistic research (e.g. lexicographic research)

Analysis of historical texts (corpus-based)

Apparent time studies with interview data (sociolinguistics)

Computer simulations

Lexical Change

Research in historical linguistics and etymology has

analysed how the usage of certain words have changed

over relatively long periods of time (primarily based on

historical corpora and lexicographic research), but overall

there are large gaps in our knowledge of lexical change,

including how newly emerging words enter a language

and spread across its speakers.

Words are Rare Events

The main problem with studying lexical variation and

change is that most words are incredibly rare, thus

requiring incredibly large corpora of natural language.

This is why most research on lexical variation and

change has focused on relatively high frequency words,

primarily function words (e.g. pronouns, prepositions,

auxiliary verbs).

Word Frequency Distribution (Zipf 1935, 1945)

Word Frequency Distribution (Zipf 1935, 1945)

The majority of the 67,000 most frequent words in our corpus occur less than once per 25 million words

Word Frequency Distribution (Zipf 1935, 1945)

New Words are Incredibly Rare Events

The analysis of new words requires even more data,

because emerging words are by definition especially

rare.

In addition, to analyse the temporal and spatial spread

of new words, large corpora must be compiled for a

large number of points in times and locations.

Big Data

Suitable data has recently become available with the

rise of the social media and smartphones, which

provide massive amounts of time-stamped and geo-

coded natural language data.

Goals of Today’s Talk

Identify emerging words from 2014 based on a multi-

billion word corpus of American tweets.

Chart their usage over time and identify common

temporal patterns of lexical spread.

Map their geographical diffusion and identify common

spatial patterns of lexical spread.

The Corpus

Since 2013, the team at USC have been compiling two

multi-billion word geocoded corpora for the US and the UK

using the Twitter API.

Twitter is a particularly rich source of geocoded data and

is also very popular, informal, and youthful, making it ideal

for tracking the emergence of new words.

Approximately 2% of tweets are geocoded.

The Corpus

The analysis today is based on a 8.9 billion word

corpus of American Tweets from October 2013-

November 2014, which totals approximately 980 million

Tweets from 7 million users.

Every tweet is geocoded with the precise longitude and

latitude of the user when posting, which were then used

to identify the county where each Tweet was produced.

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Corpus Examples

username,fips,time,tweet-­‐,48439,Sun  Jul  27  23:59:59  EDT  2014,don't  follow  the  right  ppl  lol-­‐,42007,Sun  Jul  27  23:59:59  EDT  2014,yesss  moody  judy-­‐,36005,Sun  Jul  27  23:59:59  EDT  2014,Man  i  was  just  thinking  shexx  be  lurking  but  won't  hmu-­‐,25021,Sun  Jul  27  23:59:59  EDT  2014,no  seeing  u  on  tv  is  reel  but  not  seeing  u  on  twitter  is  real  for  me...so  pls  visit  us  here  everyday.-­‐,26163,Sun  Jul  27  23:59:59  EDT  2014,Hate  seeing  my  friends  sad-­‐,12093,Sun  Jul  27  23:59:59  EDT  2014,this  is  the  shirt  i  won  that  i  got  to  sign  btw!!:)

Graveyard/Cemetery

Graveyard/Cemetery

Graveyard/Cemetery Percent

Graveyard/Cemetery Smoothed (Getis-Ord Gi)

Identifying Rising Words

To find newly emerging words, we first measured the

degree to which the usage of each word in the corpus

had been rising over the 13 month period.

To identify these rising words we extracted the 67,000

words that occur at least 1,000 times in the corpus and

compared word relative frequency per day to day of the

year using a Spearman’s rank correlation coefficient.

ρ = .044

ρ = .116

ρ = .044

ρ = .044ρ = -.028

The Top 10 Rising Words on Twitter 2014

Word ρ Definitionfuckboy 0.947 Asshole, Jerk, Poser, Tool, etc.rn 0.938 Right Now (Top Riser 2013)hbd 0.928 Happy Birthdayfw 0.927 Fuck withunbothered 0.926 Unconcerned & Disengagedft 0.925 Face timegmfu 0.924 Get me fucked upsm 0.919 So Muchsquad 0.919 Squadasf 0.918 As fuck

Identifying Emerging Words

Although measuring correlations allows for rising words

to be identified, most are far too common by 2014 to

show patterns of regional spread.

To identify emerging words we cross-referenced the list

of rising words against a list of rare words, defined as

words with low overall frequencies in the fourth quarter

of 2013 (excluding proper nouns).

Top 10 Emerging Words on Twitter 2014

Words ρ Definitionunbothered 0.926 Unconcerned & Disengagedgmfu 0.924 Get Me Fucked Upjoggers 0.908 Jogging pantsfuckboys 0.902 Losers, wimps, posers, etc.rekt 0.900 Wreckedtfw 0.879 That feel whenxans 0.878 Benzodiazepine pillsbaeless 0.875 To be without a baeboolin 0.857 Hanging out, esp. young menlordt 0.854 Lord, as exclamation

Top 11-20 Emerging Words on Twitter 2014

Words ρ Definitioncelfie 0.852 selfieslays 0.843 impresses, succeeds at, etc.famo 0.840 family and friendsfuckboi 0.838 fuckboy(on) fleek 0.838 on point, esp. eyebrowsfaved 0.836 to favorite somethinggainz 0.828 earningsbruuh 0.817 broamirite 0.816 am I rightnotifs 0.808 notifications, especially online

S-shaped Curves

In the time charts for many of the rising and emerging

words we see clear s-curves or what look like the start

of s-curves.

S-shaped Curves

Similar results have also been found repeatedly in

sociolinguistic apparent time studies (see Labov, 2001),

as well as in corpus-based research in historical

linguistics (e.g. Nevalainen & Raumolin-Brunberg, 2003).

Similar results have also been obtained in research on

the diffusion of innovations (see Rogers, 2003), where it

is referred to as an S-shaped Curve of Diffusion.

Summary: Time Patterns

New words rise (and fall) very quickly in Modern

English, with numerous new words entering the

language and quickly rising in usage every year.

The usage of emerging words over time tends to follow

an s-shaped curve, echoing results found in

sociolinguistic apparent time studies and diffusion of

innovation research.

Goals of Today’s Talk

Identify emerging words from 2014 based on a multi-

billion word corpus of American tweets.

Chart their usage over time and identify common

temporal patterns of lexical spread.

Map their geographical diffusion and identify common

spatial patterns of lexical spread.

Mapping the Spread of New Words

An important technical problem is how to map the

spread of a new word across a region.

One approach is to map the relative frequency (e.g.

occurrences per million words) of the word across a

series of regional corpora (e.g. all the tweets from a

particular county) over a series of time points.

Geographical Diffusion of Linguistic Forms

Two major theories have been proposed to explain how

new linguistic forms generally spread in language:

The Wave Model states that new forms spread out

radially from their source.

The Gravity Model states that new forms spread out

from one urban area to the next, based on distance

and population size, only later filling in less

populated areas in between.

Assessing the Wave and Gravity Models

We can begin assess the validity of the wave and

gravity models for lexical spread by comparing the

spread of unbothered.

This analysis can be facilitated by focusing on one state

where the form eventually becomes relatively common,

for example Georgia.

Atlanta

Columbus

Macon

Augusta

Savannah

Population Density of Georgia

Atlanta

Columbus

Macon

Augusta

Savannah

01 November 2013

Atlanta

Columbus

Macon

Augusta

Savannah

01 December 2013

Atlanta

Columbus

Macon

Augusta

Savannah

01 January 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 February 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 March 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 April 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 May 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 June 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 July 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 August 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 September 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 October 2014

Atlanta

Columbus

Macon

Augusta

Savannah

01 November 2014

Assessing the Wave and Gravity Models

The geographical spread of unbothered in Georgia

appears to be more complex than predicted by the

Wave or Gravity Model, although both appear to offer a

partial explanation for this pattern of spread

The percentage of African Americans, however, also

appears to be an important predictor.

African Americans in Georgia

Atlanta

Columbus

Macon

Augusta

Savannah

Atlanta

Columbus

Macon

Augusta

Savannah

01 November 2014

01 November 2014

Atlanta

Columbus

Macon

Augusta

Savannah

Presenting a time series of maps is an effective way to

map lexical spread, but another technical issue is how

to map emerging words on one map:

Relative frequency

Date of first (or second...) occurrence

Number of words until first (or second...) occurrence

Mapping the Spread of New Words on One Map

Top 10 Emerging Words on Twitter 2014

Words ρ Definitionunbothered 0.926 Unconcerned & Disengagedgmfu 0.924 Get Me Fucked Upjoggers 0.908 Jogging pantsfuckboys 0.902 Losers, wimps, posers, etc.rekt 0.900 Wreckedtfw 0.879 That feel whenxans 0.878 Benzodiazepine pillsbaeless 0.875 To be without a baeboolin 0.857 Hanging out, esp. young menlordt 0.854 Lord, as exclamation

Top 11-20 Emerging Words on Twitter 2014

Words ρ Definitioncelfie 0.852 selfieslays 0.843 impresses, succeeds at, etc.famo 0.840 family and friendsfuckboi 0.838 fuckboy(on) fleek 0.838 on point, esp. eyebrowsfaved 0.836 to favorite somethinggainz 0.828 earningsbruuh 0.817 broamirite 0.816 am I rightnotifs 0.808 notifications, especially online

Summary: Regional Patterns

New words originate from across the US, including the

Southeast (e.g. Unbothered, Baeless, Boolin), the North

(e.g. Fuckboy, Gainz), and the West (e.g. Wrekt), and

tend to spread within these regions first.

Otherwise, the spread of new words appears to be highly

complex, affected by numerous factors, including

proximity, population density, and demographic patterns.

Traditional Approaches to Historical Linguistics

The empirical analysis of language change is generally

based on historical corpora, which tend to span

centuries, or collections of linguistic interviews, which

tend to span generations (i.e. based on apparent time).

Both sources of data tend to provide a broad temporal

scope but limited temporal resolution and amounts of

data (<1 million words).

The Uniformitarian Principle

“Knowledge of processes that operated in the past can

be inferred by observing ongoing processes in the

present” (Christy, 1983: ix).

This Uniformitarian Principle is cited in Labov (2001) to

justify the use of apparent time interview data in place of

historical corpora, but it also justifies the use of

extremely large and dense contemporary corpora in

place of both of these more common approaches.

A Modern Approach to Historical Linguistics

Analysing with modern language data mined from online

sources allows for unprecedentedly large, rich and

dense natural language corpora to be compiled.

Although historical scope is lost, this approach allows for

language change to be analysed in far greater detail

than would otherwise be possible.

Tracking the Emergence of New Words across Time and Space

Jack GrieveCentre for Forensic LinguisticsAston University

Email: j.grieve1@aston.ac.ukWebsite: https://sites.google.com/site/jackgrieveastonTwitter: @JWGrieve

top related