han zhang & jennifer pan - princeton universityhz2/files/polmeth_poster_protest_method.pdf ·...

CASM: A Deep-Learning Approach for IdentifyingCollective Action Events with Text and Image Data from

Social MediaHan Zhang & Jennifer Pan

Department of Sociology, Princeton & Department of Communication, Stanford

1. Motivation

• Political collective action is a powerful tool tochallenge authoritarian regimes.

• Collecting data on collective action events inauthoritarian regimes is extremely hard:traditional mass media are subject to strongregulation.

• Large-scale protest dataset in authoritarianregimes is scant.

Fig. 1: Protest in Wukan, China, Dec. 2011. One of thelargest political protests after 1989. goo.gl/cQ2QzG

2. CASM: Collective ActionEvents from Social Media

• Goal: identify offline collective action fromsocial media data.

• Why social media: arguably the best sourcewhen news reports and government statisticsare not available.

• Why difficult:• Short texts;• Extremely rare;• Posts about protests and posts about grievances are

very similar (newspapers will pre-screen grievances).

“I just saw aprotest onstreet!”

“Real estate developers isviolating our contracts! Iwant to go out to protest!”

3. CASM: Overview

• Obtain posts with protest-related words.• Apply a two-stage deep learning

classifier, using image and textual data,to identify posts about collective action.

1 First stage identifies post about grievances.2 Second stage identifies post about protests.

• Identify unique collective action eventsfrom collective action posts, usinglocation and time information.

4. CASM-China

• Implement CASM in China: high social mediapenetration and harsh media regulation.

• Last available government statistic: 85,000 protestin 2005.

• Data source: Sina Weibo (similar to Twitter).• Training Data: 60000 protests, hand-curated from

Internet sources by two human-rights lawyers (theWickedonna Collection; goo.gl/Btr2zS).

5. CASM: System

• Input: 9.5M Weibo posts with protest-related words (dictionary size = 50), from 2011 - 2017.• Output: 197,734 unique collective action events.

Posts with Protest-related

Words (n = 9.5M)

Events(n = 197K)

Posts about Grievances(n = 871K)

Posts about Protests

(n = 283K)

Convolutional Neural

Network

Positive Texts: the Wickedonna Collection

Recurrent Neural

Network

Text Classifier Image ClassifierTraining Data: Text

Negative Text 1: Random Posts

Negative Text 2

Human coded subset( n = 40K)

Recurrent Neural

NetworkNegative Images:

Random Posts

Positive ImagesWickedonna

Training Data: ImagesInput

Rule-based Algorithm to Group

Posts under the same Event

7. External Validation with Publicly Available Datasets

• CASM-China identifies many more events than existing datasets (Jan. 1, 2016 to Jun. 30, 2016).• CASM-China covers high proportion of events from existing datasets.• CASM-China misses events in regions with limited internet and Weibo penetration (e.g., Tibet).

Source Time Range # Events Proportion Covered by CASMCASM-China Social media 2010-17 12,662GDELT Int’l newspapers 1979- 299 56%ICEWS Int’l newspapers 1979- 25 52%WiseNews Chinese newspapers 1998- 276 88%Wickedonna Social media 2013-16 11,085 70%China LaborBulletin Mixed 2011- 1,455 75%

6. Internal Validation

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierOut−of−sample ValidationCross−ValidationRandom Guess

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierTwo−stage ClassifierOne−stage ClassifierRandom Guess

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierImageTextCombined

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierDeep LearningNaive BayesSVM

8. Validating Censorship BiasCensorship does not bias CASM substantially:• Censorship focuses on bursty events, but

censorship rate is not 100%.• Most collective action events are not bursty →

low likelihood of censorship.• Test with pre-censored data: 0.54% collective

action posts later censored.

9. Contributions

• Deep learning using image and text together.• Extensive internal and external validation.• Reveals benefits and limitations of using social

media for event data.• Largest collective action dataset in any

authoritarian regime.

10. Ethical Considerations

• “Dual-use” dilemma: our method and datacould be used by malicious third parties.

• Make only event-level data public.

goo.gl/cQ2QzG

goo.gl/Btr2zS

han zhang & jennifer pan - princeton universityhz2/files/polmeth_poster_protest_method.pdf ·...

Documents