lexicon: exploring language trends on facebook walls

24
Lexicon: exploring language trends on Facebook Walls Roddy Lindsay Data Team

Upload: julie

Post on 17-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Lexicon: exploring language trends on Facebook Walls. Roddy Lindsay Data Team. What’s a Wall?. Walls are semi-public and public forums on profiles, groups, events, etc. Old. New. Numbers. Blogs 1.6 million posts per day (Technorati) ~18 posts per second Walls - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lexicon: exploring language trends on Facebook Walls

Lexicon: exploring language trends on Facebook Walls

Roddy LindsayData Team

Page 2: Lexicon: exploring language trends on Facebook Walls

What’s a Wall?

Page 3: Lexicon: exploring language trends on Facebook Walls

Walls are semi-public and public forums on profiles, groups, events, etc.

NewOld

Page 4: Lexicon: exploring language trends on Facebook Walls

Numbers▪ Blogs

▪ 1.6 million posts per day (Technorati)

▪ ~18 posts per second

▪ Walls

▪ 12-20 million wall posts per day

▪ ~180 posts per second

▪ 5-9 million unique users per day

▪ 2-2.5 GB of unstructured text per day

Page 5: Lexicon: exploring language trends on Facebook Walls

Lexicon 101

Page 6: Lexicon: exploring language trends on Facebook Walls

Brief History of Lexicon▪ First iteration: “Pulse” (2006)

▪ Interests in profile fields ranked by count

▪ E.g. “Top movies in San Francisco Network”

▪ Pros

▪ Structure through comma delimitation

▪ Cons

▪ Profile information is static (not updated frequently)

▪ Limited to profile field categories (movies, books, interests, TV shows, music)

Page 7: Lexicon: exploring language trends on Facebook Walls

Brief History of Lexicon▪ Attempt 2:

▪ Extract terms from public and semi-public conversations between friends (on the Wall)

▪ Anonymize user data to respect privacy

▪ Plot time series data to show usage trends

▪ Pros

▪ Wall conversations closer to RL conversations

▪ Topics are constantly changing, giving a strong temporal signal

▪ Cons

▪ No structure

▪ Greater computational requirements

Page 8: Lexicon: exploring language trends on Facebook Walls

How does Lexicon work?▪ Count occurrences of each word and bigram that is posted each day

▪ Aggregate by unique user to minimize the effect of spam

▪ Trim the long tail to handle data explosion

▪ Normalize for intraweek and seasonal variance by putting total posts in the denominator

▪ Interactive Flash charts rolled at home (used internally and externally for all Facebook reporting products)

“apple” “apple”

Page 9: Lexicon: exploring language trends on Facebook Walls

How does Lexicon work?▪ More technically...

▪ Use Scribe (distributed log file aggregation service built with Thrift) to collect wall post logs from web servers

▪ Have a 180-node Hadoop cluster that loads the log files into Hive, our homegrown data warehouse sitting on top of Hadoop

▪ Pipeline of Map-Reduce scripts (written in Python) that count the number unique users for each (term, day) pair, trim the long tail

▪ Load into horizontally partitioned MySQL tier for user queries

▪ PHP front-end

▪ Memcached sits in front to cache common queries

▪ All of these are (or will be) open-source projects

▪ Facebook is an active contributor to most of these projects

Page 10: Lexicon: exploring language trends on Facebook Walls

Demo

Page 11: Lexicon: exploring language trends on Facebook Walls

What is Lexicon useful for?

Page 12: Lexicon: exploring language trends on Facebook Walls

What is Lexicon useful for?

▪ Tracking news

▪ Lexicon shows relative chatter surrounding current events

▪ Can understand which events are of interest to the Facebook audience

“tibet” “died” (Heath Ledger)

Page 13: Lexicon: exploring language trends on Facebook Walls

What is Lexicon useful for?

▪ Natural language trends

▪ Words and phrases constantly enter and exit the lexicon

▪ Track the popularity of terms that are used in everyday conversation

“lulz” “pwned”

Page 14: Lexicon: exploring language trends on Facebook Walls

What is Lexicon useful for?

▪ Understanding the Facebook audience

▪ Lexicon trends can yield insights into Facebook demographics, user attitudes towards Facebook products, and how the products are used

“the add”

Page 15: Lexicon: exploring language trends on Facebook Walls

What is Lexicon useful for?

▪ Brand Mindshare

▪ Brands and commercial products are mentioned in Wall conversations, just as in face-to-face conversations

“verizon” “juno”

Page 16: Lexicon: exploring language trends on Facebook Walls

What is Lexicon useful for?

▪ Categories that are social in nature yield the strongest signal

▪ Entertainment, Mobile, Automotive, QSR, etc.

“honda”, “toyota”

Page 17: Lexicon: exploring language trends on Facebook Walls

What is Lexicon useful for?

▪ Measuring the success of sponsored gift campaigns on Facebook

▪ Sponsored gifts: images you can send to friends along with a Wall post

“coors”

Page 18: Lexicon: exploring language trends on Facebook Walls

Challenges

Page 19: Lexicon: exploring language trends on Facebook Walls

Challenges

▪ Term disambiguation

▪ Words are used in a variety of contexts

▪ E.g. my cousin Wendy’s birthday vs. Wendy’s hamburgers

▪ Tracking each different context automatically with machine learning techniques is difficult

OR ?

▪ Language classifiers, proper tokenization, and smart cleaning of the data can get us part way there

Page 20: Lexicon: exploring language trends on Facebook Walls

Challenges

▪ Sentiment

▪ Is the mention of a term positive, negative, neutral, something else?

▪ Most challenging aspects: irony, ambiguous sentiment terms, complex grammar

▪ Many top companies use humans to rate a sizable percentage of posts

▪ Numerous Ph.D. candidates have quit graduate school over this problem

▪ Obviously a difficult task...

Page 21: Lexicon: exploring language trends on Facebook Walls

Challenges

▪ Sentiment

▪ The language on Facebook wall posts is characterized by:▪ slang, lulz

▪ mispellings

▪ blunt sentences.

▪ superfluous punctuation!!!

▪ absent punctuation for example

▪ emoticons ^_^

▪ acronyms, omg

▪ a big freaking mess

Page 22: Lexicon: exploring language trends on Facebook Walls

Challenges

▪ Sentiment

▪ Blunt language without complex grammar means that irony and sarcasm aren’t big issues

▪ Synonym identification (figuring out that “hotttt” == “hot”), subjective/objective classification, and tokenization are more troublesome

▪ Something to keep in mind: strong prior probability of a subjective post being positive (80-90% as rated by humans)

▪ Walls are not blogs or movie reviews

▪ Theory: users don’t want to appear to be negative, and so avoid making overtly negative comments for the most part

▪ Sentiment classifier that guesses positive every time gives the least error

▪ Maybe sentiment isn’t as important for us...

Page 23: Lexicon: exploring language trends on Facebook Walls

Future trends for text analytics

▪ Data visualization

▪ Graph structure/Diffusion analysis

▪ Cloud computing

Page 24: Lexicon: exploring language trends on Facebook Walls

Thanks!