han zhang & jennifer pan - princeton universityhz2/files/polmeth_poster_protest_method.pdf ·...
TRANSCRIPT
CASM: A Deep-Learning Approach for IdentifyingCollective Action Events with Text and Image Data from
Social MediaHan Zhang & Jennifer Pan
Department of Sociology, Princeton & Department of Communication, Stanford
1. Motivation
• Political collective action is a powerful tool tochallenge authoritarian regimes.
• Collecting data on collective action events inauthoritarian regimes is extremely hard:traditional mass media are subject to strongregulation.
• Large-scale protest dataset in authoritarianregimes is scant.
Fig. 1: Protest in Wukan, China, Dec. 2011. One of thelargest political protests after 1989. goo.gl/cQ2QzG
2. CASM: Collective ActionEvents from Social Media
• Goal: identify offline collective action fromsocial media data.
• Why social media: arguably the best sourcewhen news reports and government statisticsare not available.
• Why difficult:• Short texts;• Extremely rare;• Posts about protests and posts about grievances are
very similar (newspapers will pre-screen grievances).
“I just saw aprotest onstreet!”
“Real estate developers isviolating our contracts! Iwant to go out to protest!”
3. CASM: Overview
• Obtain posts with protest-related words.• Apply a two-stage deep learning
classifier, using image and textual data,to identify posts about collective action.
1 First stage identifies post about grievances.2 Second stage identifies post about protests.
• Identify unique collective action eventsfrom collective action posts, usinglocation and time information.
4. CASM-China
• Implement CASM in China: high social mediapenetration and harsh media regulation.
• Last available government statistic: 85,000 protestin 2005.
• Data source: Sina Weibo (similar to Twitter).• Training Data: 60000 protests, hand-curated from
Internet sources by two human-rights lawyers (theWickedonna Collection; goo.gl/Btr2zS).
5. CASM: System
• Input: 9.5M Weibo posts with protest-related words (dictionary size = 50), from 2011 - 2017.• Output: 197,734 unique collective action events.
Posts with Protest-related
Words (n = 9.5M)
Events(n = 197K)
Posts about Grievances(n = 871K)
Posts about Protests
(n = 283K)
Convolutional Neural
Network
Positive Texts: the Wickedonna Collection
Recurrent Neural
Network
Text Classifier Image ClassifierTraining Data: Text
Negative Text 1: Random Posts
Negative Text 2
Human coded subset( n = 40K)
Recurrent Neural
NetworkNegative Images:
Random Posts
Positive ImagesWickedonna
Training Data: ImagesInput
Rule-based Algorithm to Group
Posts under the same Event
7. External Validation with Publicly Available Datasets
• CASM-China identifies many more events than existing datasets (Jan. 1, 2016 to Jun. 30, 2016).• CASM-China covers high proportion of events from existing datasets.• CASM-China misses events in regions with limited internet and Weibo penetration (e.g., Tibet).
Source Time Range # Events Proportion Covered by CASMCASM-China Social media 2010-17 12,662GDELT Int’l newspapers 1979- 299 56%ICEWS Int’l newspapers 1979- 25 52%WiseNews Chinese newspapers 1998- 276 88%Wickedonna Social media 2013-16 11,085 70%China LaborBulletin Mixed 2011- 1,455 75%
6. Internal Validation
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00recall
prec
isio
n
ClassifierOut−of−sample ValidationCross−ValidationRandom Guess
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00recall
prec
isio
n
ClassifierTwo−stage ClassifierOne−stage ClassifierRandom Guess
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00recall
prec
isio
n
ClassifierImageTextCombined
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00recall
prec
isio
n
ClassifierDeep LearningNaive BayesSVM
8. Validating Censorship BiasCensorship does not bias CASM substantially:• Censorship focuses on bursty events, but
censorship rate is not 100%.• Most collective action events are not bursty →
low likelihood of censorship.• Test with pre-censored data: 0.54% collective
action posts later censored.
9. Contributions
• Deep learning using image and text together.• Extensive internal and external validation.• Reveals benefits and limitations of using social
media for event data.• Largest collective action dataset in any
authoritarian regime.
10. Ethical Considerations
• “Dual-use” dilemma: our method and datacould be used by malicious third parties.
• Make only event-level data public.