balancing diversity to counter-measure geographical centralization in microblogging platforms

Post on 01-Dec-2014

1.023 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at Hypertext 2014.

TRANSCRIPT

Balancing Diversity to Counter-measure Geographical Centralization in Microblogging

Platforms

Eduardo Graells-GarridoWeb Research GroupUniversitat Pompeu FabraBarcelona, Spain

Mounia LalmasYahoo LabsLondon, UK

HypertextSept. 4, 2014

Santiago, Chile

Motivation: Geographical Centralization

Every person behaves in a biased way (homophily, selective exposure, etc.) in both physical and virtual worlds.

Does the same happen with systematic biases?

Chile is a centralized country - public policy, population migration and media are biased towards its capital. This is increasing the population imbalance, and vice versa!

Some Effects of Geographical Centralization

This affects Web users as content is not geographically diverse (mostly related to/from Santiago). Content from other locations is hidden and hard to find.

(I was at WWW when I searched for this. “Everywhere” displays relevant tweets from Santiago only.)

Problem Statement

Detect and Measure Geographical CentralizationIs centralization reflected on micro-blogging platforms?

Tweet Classification into LocationsHow to find tweets from other locations in imbalanced contexts?

[Rout et al, HT 2013] studied geolocation in imbalanced populations from a network perspective. We follow a similar approach from a content perspective.

Information Filtering - Geo. Diverse TimelineHow to build a geographically diverse timeline?

We build upon the work of others based on information diversity filtering. [De Choudhury et al, HT 2011] and [Munson et al, ICWSM 2009]

Case Study: Chile, Municipal Elections 2012Is Geographical Centralization Reflected on Twitter?

Frequent Terms

Dataset: #municipales2012

Locally Important Denser network discussionsLocal vocabulary (classification)

National LevelInteractions between locations

Query Keywordshashtags, tenses of to-vote, candidate names, political institutions, locations

Using self-reported location, 27,95% of users is geolocated at regional level. They published 42,15% of tweets in dataset.

Ideal characteristics, but there is a need to classify tweets.

Physical and Virtual Population Distributions

. We consider the sample geographically representative.

r = 0.95, p < 0.01Source: Census 2012*

r = 0.68, p < 0.01Source: CASEN Survey

Imbalanced Population(Different Orders of Magnitude)

Balanced Representation (Equal Orders of Magnitude)

Is the Chilean Virtual Population in Twitter centralized towards the capital Metropolitan Region?

Interactions Between Locations

Adjacency Matrix of 1-way interactions. [Quercia et al, 2012]

M(i,j) = mentions(Li, Lj) + retweets(Li, Lj)

Each arc in the visualization represents a M(i,j). Li is on the left, Lj on the right.

Green edges indicate i = j.Brown edges indicate j = Santiago

(RM).The rest is gray.

Geographical Centralization

We explain the extreme differences between observations and expectations as geographical centralization towards Santiago (Metropolitan Region)

Observed CentralityEstimated from a graph based on M.

Expected CentralityEstimated from a graph with edge weights based on location populations.

How to make timelines more Geographically Diverse?

Shannon Entropy with respect to geography

First: Classifying Tweets into Locations with Diversity

We built a corpus of location documents.For classification we consider a tweet as a vector of cosine similarities with each location document, weighted using TF-IDF. We evaluate with 10-fold cross-validation.

Similarity features provide more geographical diversity (lost because of population imbalance) and are overall more accurate than bag of words approaches.

Similarity Features

BOW Features

We iteratively add tweets to a timeline T. Each added tweet maximizes T’s information entropy [Choudhury et al, 2011], but we enforce geographical diversity of those additions [Munson et al, 2009].

Second: Filtering Tweets to build a Geo. Diverse Timeline

Empirical Observationselection results start to appear!

unexpected results in some location! discussion becomes a bit more global. in all cases, geographical diversity exists.

Proposed Method is more geographically diverse than baselines:DIV [Choudhury et al, HT 2011]POP: top-k popular tweets

in terms of social voting, PM has more representation of popular tweets than DIV.

Overview of Results

Is centralization reflected on micro-blogging platforms?Yes! As with other behavioral biases (homophily, selective exposure), the systematic bias of geographical centralization is also present and is measurable.

How to find tweets from other locations?Consider imbalance-aware features, such as content similarity metrics. This improves diversity of classifications without losing accuracy.

How to build a geographically diverse timeline?A correct mixture of known techniques can have the desired effects without trade-offs! (gained representation of popularity, did not lose info. diversity)In contrast to sensitive contexts where selective exposure is crucial, geographical diversity is less likely to generate cognitive dissonance.

Future Work

User Evaluationis geographical diversity interesting?

Visualization and User Interfacesis geographical diversity engaging?

Questions?

Thanks for attending!

Contact@carnby

http://carnby.github.io

Special ThanksDany Passarinho, Bárbara Poblete, Diego Sáez-Trumper and Anonymous Reviewers

This work was partially funded by Grant TIN2012-38741 (Understanding Social Media: An Integrated Data Mining Approach) of the Ministry of Economy and Competitiveness of Spain.

https://www.flickr.com/photos/malikaladak/8868491759https://www.flickr.com/photos/28047774@N04/6312764345

https://www.flickr.com/photos/iron_horses/6274365371https://www.flickr.com/photos/efimeravulgata/1429969601

Additional Data :)

top related