han zhang & jennifer pan - princeton universityhz2/files/polmeth_poster_protest_method.pdf ·...

1
CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media Han Zhang & Jennifer Pan Department of Sociology, Princeton & Department of Communication, Stanford 1. Motivation Political collective action is a powerful tool to challenge authoritarian regimes. Collecting data on collective action events in authoritarian regimes is extremely hard: traditional mass media are subject to strong regulation. Large-scale protest dataset in authoritarian regimes is scant. Fig. 1: Protest in Wukan, China, Dec. 2011. One of the largest political protests after 1989. goo.gl/cQ2QzG 2. CASM: Collective Action Events from Social Media Goal: identify offline collective action from social media data. Why social media: arguably the best source when news reports and government statistics are not available. Why difficult: Short texts; Extremely rare; Posts about protests and posts about grievances are very similar (newspapers will pre-screen grievances). “I just saw a protest on street!” “Real estate developers is violating our contracts! I want to go out to protest!” 3. CASM: Overview Obtain posts with protest-related words. Apply a two-stage deep learning classifier, using image and textual data, to identify posts about collective action. 1 First stage identifies post about grievances. 2 Second stage identifies post about protests. Identify unique collective action events from collective action posts, using location and time information. 4. CASM-China Implement CASM in China: high social media penetration and harsh media regulation. Last available government statistic: 85,000 protest in 2005. Data source: Sina Weibo (similar to Twitter). Training Data: 60000 protests, hand-curated from Internet sources by two human-rights lawyers (the Wickedonna Collection; goo.gl/Btr2zS). 5. CASM: System Input: 9.5M Weibo posts with protest-related words (dictionary size = 50), from 2011 - 2017. Output: 197,734 unique collective action events. Posts with Protest-related Words (n = 9.5M) Events (n = 197K) Posts about Grievances (n = 871K) Posts about Protests (n = 283K) Convolutional Neural Network Positive Texts: the Wickedonna Collection Recurrent Neural Network Text Classifier Image Classifier Training Data: Text Negative Text 1: Random Posts Negative Text 2 Human coded subset ( n = 40K) Recurrent Neural Network Negative Images: Random Posts Positive Images Wickedonna Training Data: Images Input Rule-based Algorithm to Group Posts under the same Event 7. External Validation with Publicly Available Datasets CASM-China identifies many more events than existing datasets (Jan. 1, 2016 to Jun. 30, 2016). CASM-China covers high proportion of events from existing datasets. CASM-China misses events in regions with limited internet and Weibo penetration (e.g., Tibet). Source Time Range # Events Proportion Covered by CASM CASM-China Social media 2010-17 12,662 GDELT Int’l newspapers 1979- 299 56% ICEWS Int’l newspapers 1979- 25 52% WiseNews Chinese newspapers 1998- 276 88% Wickedonna Social media 2013-16 11,085 70% China Labor Bulletin Mixed 2011- 1,455 75% 6. Internal Validation 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 recall precision Classifier Out-of-sample Validation Cross-Validation Random Guess 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 recall precision Classifier Two-stage Classifier One-stage Classifier Random Guess 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 recall precision Classifier Image Text Combined 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 recall precision Classifier Deep Learning Naive Bayes SVM 8. Validating Censorship Bias Censorship does not bias CASM substantially: Censorship focuses on bursty events, but censorship rate is not 100%. Most collective action events are not bursty low likelihood of censorship. Test with pre-censored data: 0.54% collective action posts later censored. 9. Contributions Deep learning using image and text together. Extensive internal and external validation. Reveals benefits and limitations of using social media for event data. Largest collective action dataset in any authoritarian regime. 10. Ethical Considerations “Dual-use” dilemma: our method and data could be used by malicious third parties. Make only event-level data public.

Upload: others

Post on 11-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Han Zhang & Jennifer Pan - Princeton Universityhz2/files/polmeth_poster_protest_method.pdf · CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and

CASM: A Deep-Learning Approach for IdentifyingCollective Action Events with Text and Image Data from

Social MediaHan Zhang & Jennifer Pan

Department of Sociology, Princeton & Department of Communication, Stanford

1. Motivation

• Political collective action is a powerful tool tochallenge authoritarian regimes.

• Collecting data on collective action events inauthoritarian regimes is extremely hard:traditional mass media are subject to strongregulation.

• Large-scale protest dataset in authoritarianregimes is scant.

Fig. 1: Protest in Wukan, China, Dec. 2011. One of thelargest political protests after 1989. goo.gl/cQ2QzG

2. CASM: Collective ActionEvents from Social Media

• Goal: identify offline collective action fromsocial media data.

• Why social media: arguably the best sourcewhen news reports and government statisticsare not available.

• Why difficult:• Short texts;• Extremely rare;• Posts about protests and posts about grievances are

very similar (newspapers will pre-screen grievances).

“I just saw aprotest onstreet!”

“Real estate developers isviolating our contracts! Iwant to go out to protest!”

3. CASM: Overview

• Obtain posts with protest-related words.• Apply a two-stage deep learning

classifier, using image and textual data,to identify posts about collective action.

1 First stage identifies post about grievances.2 Second stage identifies post about protests.

• Identify unique collective action eventsfrom collective action posts, usinglocation and time information.

4. CASM-China

• Implement CASM in China: high social mediapenetration and harsh media regulation.

• Last available government statistic: 85,000 protestin 2005.

• Data source: Sina Weibo (similar to Twitter).• Training Data: 60000 protests, hand-curated from

Internet sources by two human-rights lawyers (theWickedonna Collection; goo.gl/Btr2zS).

5. CASM: System

• Input: 9.5M Weibo posts with protest-related words (dictionary size = 50), from 2011 - 2017.• Output: 197,734 unique collective action events.

Posts with Protest-related

Words (n = 9.5M)

Events(n = 197K)

Posts about Grievances(n = 871K)

Posts about Protests

(n = 283K)

Convolutional Neural

Network

Positive Texts: the Wickedonna Collection

Recurrent Neural

Network

Text Classifier Image ClassifierTraining Data: Text

Negative Text 1: Random Posts

Negative Text 2

Human coded subset( n = 40K)

Recurrent Neural

NetworkNegative Images:

Random Posts

Positive ImagesWickedonna

Training Data: ImagesInput

Rule-based Algorithm to Group

Posts under the same Event

7. External Validation with Publicly Available Datasets

• CASM-China identifies many more events than existing datasets (Jan. 1, 2016 to Jun. 30, 2016).• CASM-China covers high proportion of events from existing datasets.• CASM-China misses events in regions with limited internet and Weibo penetration (e.g., Tibet).

Source Time Range # Events Proportion Covered by CASMCASM-China Social media 2010-17 12,662GDELT Int’l newspapers 1979- 299 56%ICEWS Int’l newspapers 1979- 25 52%WiseNews Chinese newspapers 1998- 276 88%Wickedonna Social media 2013-16 11,085 70%China LaborBulletin Mixed 2011- 1,455 75%

6. Internal Validation

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierOut−of−sample ValidationCross−ValidationRandom Guess

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierTwo−stage ClassifierOne−stage ClassifierRandom Guess

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierImageTextCombined

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00recall

prec

isio

n

ClassifierDeep LearningNaive BayesSVM

8. Validating Censorship BiasCensorship does not bias CASM substantially:• Censorship focuses on bursty events, but

censorship rate is not 100%.• Most collective action events are not bursty →

low likelihood of censorship.• Test with pre-censored data: 0.54% collective

action posts later censored.

9. Contributions

• Deep learning using image and text together.• Extensive internal and external validation.• Reveals benefits and limitations of using social

media for event data.• Largest collective action dataset in any

authoritarian regime.

10. Ethical Considerations

• “Dual-use” dilemma: our method and datacould be used by malicious third parties.

• Make only event-level data public.