the linguistics of twitter - pycon 2011 presentation
DESCRIPTION
'The Linguistics of Twitter' presentation from PyCon 2011 which I hope starts a dialogue about what we need to accurately measure the effects of social media.TRANSCRIPT
![Page 1: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/1.jpg)
American EnglishRegional Dialects
Changing Speech PatternsChanging Online Measurement
Michael D. [email protected]://michaeldhealy.com@MichaelDHealy
@MichaelDHealy
![Page 2: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/2.jpg)
Michael D. Healy
• Econometrics• Linguistics
• Not an Engineer
Measuring and Influencing Online and Offline Behavior
Why am I here?
This Seemed Like an Interesting Problem
@MichaelDHealy
![Page 3: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/3.jpg)
Plan of Action
• Background• Where We Stand
o Data Collection Interlude• Historical Context• Where We May Be Going• Potential Solutions
o Sort Of
@MichaelDHealy
![Page 4: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/4.jpg)
Introduction: Hawaiian Pidgin Video
@MichaelDHealy
![Page 5: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/5.jpg)
Plan of Action
• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions
@MichaelDHealy
![Page 6: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/6.jpg)
BackgroundRegional Differences In Word Choice
@MichaelDHealy
MrEverything6's Tweet
Dallas, Texas Region
coke - Coca-Cola or soft drink in general?
Coca-Cola Probably Wants To Know
![Page 7: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/7.jpg)
BackgroundRegional Differences In PronunciationMore Than Just Drawl
@MichaelDHealy
pin
Is that:Pin a tail on the donkey.-OR-Give me a 'pin' to write with.
![Page 8: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/8.jpg)
Plan of Action
• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions
@MichaelDHealy
![Page 9: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/9.jpg)
Where We Stand
@MichaelDHealy
![Page 10: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/10.jpg)
Where We Stand
@MichaelDHealy
![Page 11: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/11.jpg)
Detailed Dialectical MapDetailed Dialectical Map
http://aschmann.net/AmEng/
![Page 12: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/12.jpg)
Where We Stand
@MichaelDHealy
Wait!Isn't This All Just Poor English?They Don't Speak The King's English!
1) America Doesn't Have A King
![Page 13: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/13.jpg)
Where We Stand
@MichaelDHealy
Wait!Isn't This All Just Poor English?
2) English Doesn't Have An Authority Like:
French: L'Académie française
Spanish: Asociación de Academias de la Lengua Española
Numerous Others:http://en.wikipedia.org/wiki/List_of_language_regulators
![Page 14: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/14.jpg)
Where We Stand
@MichaelDHealy
Who Is Right?Everyone
Prescriptive Linguistics: Tell You What Is Right
Descriptive Linguistics: Describe How You Communicate
Trying To Sell More Widgets?
Probably Descriptive Is Best
![Page 15: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/15.jpg)
Where We Stand
@MichaelDHealy
Selected American English Dialects:• New England• Northern• North Midland• South Midland• NYC• Western• AAVE• Hawaiian Pidgin
![Page 16: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/16.jpg)
Plan of Action
• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions
@MichaelDHealy
![Page 17: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/17.jpg)
Historical Context
@MichaelDHealy
Linguists Thought TV Would Make Us All Sound The Same
Think Tom Brokaw
Area of
'StandardAmericanEnglish'
Not Overly LargeNot Largely Populated
![Page 18: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/18.jpg)
Historical Context
@MichaelDHealy
Been To Wisconsin?
Seen Fargo?
Biggest Change In Spoken English Since 1750
Going On Right Now - After TV
'Oh yeah? Yeah'
![Page 19: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/19.jpg)
Historical Context
@MichaelDHealy
Portions Of America Experience Some or All ofNorthern Cities Vowel Shift
![Page 20: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/20.jpg)
Historical Context
@MichaelDHealy
Sum This Up:People In The Northern Cities Region Are Producing A Very Different Sounding English From Other Dialects
![Page 21: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/21.jpg)
Historical Context
@MichaelDHealy
America Has Been Multi-Lingual Since July 9, 1776
![Page 22: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/22.jpg)
Plan of Action
• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions
@MichaelDHealy
![Page 23: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/23.jpg)
Where We May Be Going
@MichaelDHealy
![Page 24: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/24.jpg)
Where We May Be Going
@MichaelDHealy
~ 74% of AmericansLive In A Megaregion
Megaregions Tied To Existing Dialect Regions
![Page 25: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/25.jpg)
Where We May Be Going
@MichaelDHealy
William Labov, PhD.Professor of LinguisticsUniversity of Pennsylvaniahttp://www.ling.upenn.edu/~wlabov/
Pretty Much The Authority on American English Dialects
'And instead of getting a pepper-and-salt effect, we find very clear and sharp divisions between the dialects of the United States, which are getting more different from each other as time goes on.'
![Page 26: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/26.jpg)
Plan of Action
• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions
@MichaelDHealy
![Page 27: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/27.jpg)
Potential Solutions
One American Dialect Is Unique In Geography:
African-American Vernacular English (AAVE)
Not In A Geographically Contiguous Region
@MichaelDHealy
![Page 28: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/28.jpg)
Potential Solutions
@MichaelDHealy
Center For Applied Linguistics.
"Thats the way baseball go."
![Page 29: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/29.jpg)
Potential Solutions
@MichaelDHealy
Correct the Spelling & Grammar
import enchantfrom nltk.metrics import edit_distanceclass SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = 2 def replace(self, word): if self.spell_dict.check(word): Return word suggestions = self.spell_dict.suggest(word)
if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: Return suggestions[0] else: return word
![Page 30: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/30.jpg)
Potential Solutions
@MichaelDHealy
Example 1
well im gonna go so i’ll talk to u lata 1
Corrected Example 1
Well mi Donna go so I'll talk to U late
![Page 31: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/31.jpg)
Potential Solutions
@MichaelDHealy
Build Out a Dictionary of Words
Regex Match and Replace
proper_words = {'hater': ['enemy','jealous individual','not friend']'coke': ['coke', 'soda', 'pop']}
Which Region?
![Page 32: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/32.jpg)
Potential Solutions
@MichaelDHealy
Example 2
well i gotta go, i’ll talk to you later aight bye 1
![Page 33: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/33.jpg)
Potential Solutions
@MichaelDHealy
import rereplacement_patterns = [ (r'gotta', 'got to'), (r"i\'ll", 'I will'), ('aight','all right')]
class RegexReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: (s, count) = re.subn(pattern, repl, s) return s
![Page 34: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/34.jpg)
Potential Solutions
@MichaelDHealy
Example 2
well i gotta go, i’ll talk to you later aight bye 1
well i got to go, I will talk to you laterAll rightBye1 (!?)
![Page 35: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/35.jpg)
Potential Solutions
@MichaelDHealy
Example 2
well i got to go, I will talk to you laterAll rightBye1 (!?)
Here '1' has the concept of: I understand
![Page 36: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/36.jpg)
Potential Solutions
@MichaelDHealy
Solution?Bayesian Prediction Using a Custom Corpus
First Step: Tag Existing Data
import nltk.datatokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
def tokenize(para): print tokenizer.tokenize(para)
![Page 37: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/37.jpg)
Potential Solutions
@MichaelDHealy
Solution?Bayesian Prediction Using a Custom Corpus
Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol
Tokenized as:'Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol'
So lots of custom work to be done . .
![Page 38: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/38.jpg)
Potential Solutions
@MichaelDHealy
_andBeautyKills: – after tonight, don’t leave your boy roun’ me,umma #true playa fareal.
Local To SF:Neecy89: This african boy jus started askin me hella questions idk if he was tryin to be nice or tryna kill me lol
![Page 39: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/39.jpg)
Potential Solutions
@MichaelDHealy
Geographic IndexingSimpleGeoimport simplegeo.shared, simplegeo.placesfrom simplegeo.shared import Feature
client = simplegeo.places.Client('your-oauth-token', 'your-oauth-secret')properties = {"province":"CA","city":"San Francisco","name":"SimpleGeo SF", \\ "country":"US", "phone":"+1 415 626 1375","address":"41 Decatur St", \\ "postcode":"94103"}f = simplegeo.places.Feature((37.772392, -122.405752), properties=properties)client.add_feature(f)'SG_5uZpvipNjVaSbbDv5bvZaa_37.772392_-122.405752@1291847366'
![Page 40: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/40.jpg)
Potential Solutions
@MichaelDHealy
Geographic IndexingSimpleGeo: Queries
import simplegeo.placesdef start(lon,lat): oauth,secret = open('/home/michael/.simplegeo','r').read().strip().split('\n') client = simplegeo.places.Client(oauth,secret) results = client.search(lon,lat) return results
def search(lon,lat,tweet) results = start(lon,lat) for word in tweet.split(): for i in results: data = i.to_dict() if word == data['properties']['name']: print data['name'],word
![Page 41: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/41.jpg)
Potential Solutions: SimpleGeo-Tools
@MichaelDHealy
import simplegeo.placesimport simplegeo.context
class SimpleGeoAuth(object): def __init__(self): self.oauth,self.secret = open('/home/michael/.simplegeo','r').read().strip().split('\n') self.places_client = simplegeo.places.Client(self.oauth,self.secret) self.context_client = simplegeo.context.Client(self.oauth,self.secret) def SimpleGeoContextualQuery(self,lat,lon,text): geo_results = self.places_client.search(lat,lon) for word in text.split(): for geo_result in geo_results: data = geo_result.to_dict() if word == data['properties']['name']: return data['name'],word def SimpleGeoContextQuery(self,lat,lon): context_results = self.context_client.get_context(lat,lon) return context_results
![Page 42: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/42.jpg)
Potential Solutions:Connect the APIS
@MichaelDHealy
![Page 43: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/43.jpg)
References
@MichaelDHealy
Jacob Perkins: NLTK Master Ninja Python Text Processing with NLTK2.0 Cookbook https://www.packtpub.com/python-text-processing-nltk-20-cookbook/book http://streamhacker.com/
A Latent Variable Model for Geographic Lexical Variation. Eisenstein, J., O'Connor, B., Smith, N., and Xing, E. (2010). In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, October 2010.
You are where you tweet: a content-based approach to geo-locating twitter users. (2010). Cheng, Z., Caverlee, J., Lee, K. CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management, 2010
![Page 44: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/44.jpg)
References
@MichaelDHealy
Repustate: Sentiment Analysis API http://repustate.com/
Rapleaf Personalization API https://www.rapleaf.com/
SimpleGeo GIS Solution API http://simplegeo.com/
![Page 45: The Linguistics of Twitter - PyCon 2011 Presentation](https://reader031.vdocument.in/reader031/viewer/2022013100/54958d6dac7959182e8b4e4e/html5/thumbnails/45.jpg)
Michael D. Healy SimpleGeo-Tools
@MichaelDHealy
Michael D. Healy [email protected] http://michaeldhealy.com @MichaelDHealy
SimpleGeo-Tools https://github.com/michaeldhealy/SimpleGeo-Tools