han zhang & jennifer pan - princeton universityhz2/files/polmeth_poster_protest_method.pdf ·...

CASM: A Deep-Learning Approach for IdentifyingCollective Action Events with Text and Image Data from

Social MediaHan Zhang & Jennifer Pan

Department of Sociology, Princeton & Department of Communication, Stanford

1. Motivation

• Political collective action is a powerful tool tochallenge authoritarian regimes.

• Collecting data on collective action events inauthoritarian regimes is extremely hard:traditional mass media are subject to strongregulation.

• Large-scale protest dataset in authoritarianregimes is scant.

Fig. 1: Protest in Wukan, China, Dec. 2011. One of thelargest political protests after 1989. goo.gl/cQ2QzG

2. CASM: Collective ActionEvents from Social Media

• Goal: identify offline collective action fromsocial media data.

• Why social media: arguably the best sourcewhen news reports and government statisticsare not available.

• Why difficult:• Short texts;• Extremely rare;• Posts about protests and posts about grievances are

very similar (newspapers will pre-screen grievances).

“I just saw aprotest onstreet!”

“Real estate developers isviolating our contracts! Iwant to go out to protest!”

3. CASM: Overview

• Obtain posts with protest-related words.• Apply a two-stage deep learning

classifier, using image and textual data,to identify posts about collective action.

1 First stage identifies post about grievances.2 Second stage identifies post about protests.

• Identify unique collective action eventsfrom collective action posts, usinglocation and time information.

4. CASM-China

• Implement CASM in China: high social mediapenetration and harsh media regulation.

• Last available government statistic: 85,000 protestin 2005.

• Data source: Sina Weibo (similar to Twitter).• Training Data: 60000 protests, hand-curated from

Internet sources by two human-rights lawyers (theWickedonna Collection; goo.gl/Btr2zS).

5. CASM: System

• Input: 9.5M Weibo posts with protest-related words (dictionary size = 50), from 2011 - 2017.• Output: 197,734 unique collective action events.

Posts with Protest-related

Words (n = 9.5M)

Events(n = 197K)

Posts about Grievances(n = 871K)

Posts about Protests

(n = 283K)

Convolutional Neural

Network

Positive Texts: the Wickedonna Collection

Recurrent Neural

Network

Text Classifier Image ClassifierTraining Data: Text

Negative Text 1: Random Posts

Negative Text 2

Human coded subset( n = 40K)

Recurrent Neural

NetworkNegative Images:

Random Posts

Positive ImagesWickedonna

Training Data: ImagesInput

Rule-based Algorithm to Group

Posts under the same Event

7. External Validation with Publicly Available Datasets

• CASM-China identifies many more events than existing datasets (Jan. 1, 2016 to Jun. 30, 2016).• CASM-China covers high proportion of events from existing datasets.• CASM-China misses events in regions with limited internet and Weibo penetration (e.g., Tibet).

Source Time Range # Events Proportion Covered by CASMCASM-China Social media 2010-17 12,662GDELT Int’l newspapers 1979- 299 56%ICEWS Int’l newspapers 1979- 25 52%WiseNews Chinese newspapers 1998- 276 88%Wickedonna Social media 2013-16 11,085 70%China LaborBulletin Mixed 2011- 1,455 75%

6. Internal Validation

0.00 0.25 0.50 0.75 1.00recall

ClassifierOut−of−sample ValidationCross−ValidationRandom Guess

0.00 0.25 0.50 0.75 1.00recall

ClassifierTwo−stage ClassifierOne−stage ClassifierRandom Guess

0.00 0.25 0.50 0.75 1.00recall

ClassifierImageTextCombined

0.00 0.25 0.50 0.75 1.00recall

ClassifierDeep LearningNaive BayesSVM

8. Validating Censorship BiasCensorship does not bias CASM substantially:• Censorship focuses on bursty events, but

censorship rate is not 100%.• Most collective action events are not bursty →

low likelihood of censorship.• Test with pre-censored data: 0.54% collective

action posts later censored.

9. Contributions

• Deep learning using image and text together.• Extensive internal and external validation.• Reveals benefits and limitations of using social

media for event data.• Largest collective action dataset in any

authoritarian regime.

10. Ethical Considerations

• “Dual-use” dilemma: our method and datacould be used by malicious third parties.

• Make only event-level data public.

han zhang & jennifer pan - princeton universityhz2/files/polmeth_poster_protest_method.pdf ·...

Documents

ipp- casm 83

ticp preparation & casm data entry projects overview

2009 denver contingency acquisition support model (casm)

office of emergency communications ictap/casm-pres-003-r1 -...

community and small-scale mining (casm - fondo...

certified agile scrum master (casm)

· casm casm -casm 18.37 17.09 c 15.62 without psu casm...

michigan communications field operations guide - 2016 ·...

casm prospectus

context aware and adaptive security for wireless...

using the integrated sand-casm model in support of

casm workshop -...

big data using big data for cultures and communities jeremy...

casm the centre for the analysis of social media

advanced controls and phm ge aviation perspective ·...

casm 85 manual

casm electric cylinders - skf.com electric cylinders are...

casm citrix federationrunbook ver1.0

w-casm instruction manual - speedtech lights

electric cylinders casm-100...electric cylinders casm-100...