towards a data-driven approach to identify crisis-related topics in social media streams

19
Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX) Qatar Computing Research Institute Doha, Qatar. SWDM’15 : WWW’15 May 18 th 2015

Upload: muhammad-imran

Post on 25-Jul-2015

486 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX)

Qatar Computing Research Institute

Doha, Qatar.

SWDM’15 : WWW’15 May 18th 2015

Page 2: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Information Variability on Social Media

• Different events present different information categories

• Even for recurring events, categories proportion change

Page 3: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Information Variability on Social Media

• Different events present different information categories

• Even for recurring events, categories proportion change

Page 4: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Information Variability on Social Media

• Different events present different information categories

• Even for recurring events, categories proportion change

Page 5: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Information Variability on Social Media

• Different events present different information categories

• Even for recurring events, categories proportion change

Page 6: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Information Variability on Social Media

• Different events present different information categories

• Even for recurring events, categories proportion change

Page 7: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Different Classification Approaches

• Various classification approaches exist:– Manual classification by human experts– Automatic classification using unsupervised or

supervised approaches(needs training data)– Hybrid: Automatic + Manual

• Retrospective vs. real-time classification– Batch processing (offline, training data availability)– Stream processing (real-time, scarce training data)

Page 8: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Real-time Stream Classification (Supervised )

• Fewer categories are better– Decrease workers dropout – More training data for each category, more accuracy– “7 plus/minus 2” rule [G. A. Miller, 56]

• Categories need to be defined carefully– Empty categories (waste space and efforts of workers)– Categories that are too large introduce heterogeneity

Page 9: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Problem Statement

• How can we classify items arriving as a data stream into a small number of categories, if we cannot anticipate exactly which will be the most frequent categories?

Our research improves crowdsourcing-based and supervised learning-based systems (e.g. AIDR) by finding latent categories in fast data streams.

Page 10: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Our Approach (top-down + bottom-up)

1. An expert defines information categories (top-down)2. Messages are categorized into the initial set plus an

extra “Miscellaneous” category3. Identify relevant and prevalent categories from the

messages in the “Miscellaneous” category (bottom-up)

1. Generate candidate categories2. Learn characteristics of good categories3. Rank categories on good characteristics

How do we identify relevant categories?

Page 11: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Candidate Generation

We propose to apply Latent Dirichlet Allocation (LDA) on the Miscellaneous category:• Input: A set of n documents (all messages in

the Misc. category) and a number m (# of topics to be generated)

• Output: n x m matrix in which cell(i, j) indicates the extent to which document i corresponds to topic j.

Page 12: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Candidate Evaluation

To reduce the workload of experts to decide which categories to pick or not, we propose the following criteria:• Volume: a category shouldn’t be too small• Novelty: a category must not overlap or be too

similar to the existing categories• Cohesiveness (intra- and inter-similarity): a

category should be cohesive (should have small intra-topic and large inter-topic values)

Page 13: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Experimental Testing• We used Twitter data of 17 crises (from the

CrisisLexT26 dataset at crisislex.org)

A. Affected individuals, deaths, injuries, missing, found.

B. Infrastructure and utilities: buildings, roads, services damage.

C. Donation and volunteering: needs, requests of food, shelter, supplies.

D. Caution and advice: warnings issued or lifted, guidance and tips.

E. Sympathy and emotional support: thoughts, prayers, gratitude, etc.

Z. Other useful information not covered by any of the above categories.

Page 14: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Candidate Generation Setup

• Applied LDA on the messages in the “Z” category of each crisis

• 5 topics were generated for each crisis• Considered messages with LDA score > 0.06 in

each topic• Presented the LDA generated topics to experts

in random order

Page 15: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Candidate Annotation Setup

Recruited two experts from two Int. humanitarian organizations in the crisis response domain

Page 16: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Results• Topics with avg. score <= 2.5 considered as bad topics• Topics with avg. score >= 3.5 considered as good topics• Hit: if the metric value of good topics > bad topics

A crisis is not considered for evaluation, if all of its topics receive an average score either below or above 3.0.

Page 17: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Conclusion

• Novelty, intra-similarity and cohesiveness are useful in identifying good topics

• Our approach combines top-down (manual) and bottom-up (automatic) elements.

• Learned important characteristics of good topics

• Future work includes candidate ranking including recommendation for adding, merging, dropping new unseen categories

Page 18: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Data used in this study can be requested:Contact: Muhammad Imran at

[email protected] OR @mimran15

Page 19: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Thank you!

Authors contact:Muhammad Imran @mimran15Carlos Castillo @ChaToX