a topic analysis approach to revealing discussions on the australian twittersphere

Post on 15-Jan-2017

152 Views

Category:

Social Media

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A TOPIC ANALYSIS APPROACH TO REVEALING DISCUSSIONS ON THE AUSTRALIAN TWITTERSPHERE

Brenda MoonQueensland University of Technology

Introduction

This paper investigates techniques to identify the topics being discussed in one week of tweets from the Australian Twittersphere. Tweets were extracted from a comprehensive dataset which captures all tweets by 2.8m Australian: the Tracking Infrastructure for Social Media Analysis (TrISMA) (Bruns, Burgess & Banks et al., 2016).

Selected week: Sunday 2 August to Saturday 8 August 2015

• Thursday 6th August 2015 was used for One Day in the Life of a National Twittersphere (Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016)

• Same day used for initial development of topic modelling approach

• Then extended to full week

Latent Dirichlet Allocation

Blei, D. M. (2011)

Data cleaning

• Remove – retweets & multitweets (“rt”, “mt” or “via”)– URLs– dates, times, distances & weights– Words less than 3 characters – elipses ('...’)

• NTLK tokenisation using Twitter Tokenizer– Remove all @users and urls– Lowercase

• Convert – HTML entities to text– Hashtags to words (trim ‘#’ off hashtags)

• NLTK lemmatisation• NLTK stopwords

Hashtag pooling

• Mehrotra, Sanner, Buntine & Xie (2013) looked at different options of ‘pooling’ tweets into documents before LDA analysis to see if this could increase accuracy. They found that hashtag pooling was effective (best was hashtag pooling with clustering, but more complex to apply)

• Group all the tweets with hashtags into documents for each hashtag (some tweets will be added into more than one document)

• Tweets without hashtags stay as individual documents

Corpus filtering (Thursday 6 August 2015)

• Raw tweets: 963,064• After data cleaning: 583,528• After hashtag pooling: 516,263

– 23% of tweets had hashtags• Dictionary pruning – remove most frequent and least

frequent terms – no_above=0.5 (percent of documents), no_below=5

(documents)– 223,157 unique tokens reduced to 49,964 unique tokens

Latent Dirichlet Allocation (LDA)

• Gensim LDA (Lau & Baldwin, 2014)• LdaMulticore• Identify 30 topics• 100 passes

Thursday 6th August 2015 – overall terms

https://github.com/bmabey/pyLDAvis

Thursday 6th August 2015Topic 2: Politics / coal / China / Queensland

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Thursday 6th August 201530 topics, With hashtag pooling.

MH370

Thursday 6th August 201530 topics, With hashtag pooling.Comparison to other study

Pop?

Teen culture?

MH370

1.1m tweets from 147k, to 224k accounts294k nodes total, including non-Australians535k edges from 856k @mentions / RTs

Visualisation: Gephi, Force Atlas 2Colours: Gephi, modularity resolution 1.0

Labels assigned through qualitative evaluation

Politics

Cricket

Teen CulturePop

From “One Day in the Life of a National Twittersphere” by Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016.

Further Outlook• Confirm initial topic labelling by looking at top tweets for each topic• Check whether the hashtag pooling has allowed non-hashtag tweet

topics to still be visible• Use statistical coherence of model (U_Mass Coherence, C_V

coherence) to tune LDA parameters• Model different numbers of topics (coarse/fine grain)• Relate topics per user back to our mention network graphs• Extend to the full week (or longer)• Compare to alternative approaches

– Doc2Vec / Tensorflow / dynamic LDA etc

References

• Blei, D. M. (2011). Introduction to probabilistic topic models. Communications of the ACM, 1–16. Retrieved from http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

• Mehrotra, R., Sanner, S., Buntine, W., & Xie, L. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 889–892. http://doi.org/10.1145/2484028.2484166

• Lau, J. H., & Baldwin, T. (2014). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation.

• Puschmann, C., & Scheffler, T. (2016). Topic modeling for media and communication research : A short primer (HIIG Discussion Paper Series No. 2016–5). Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2836478

• Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70. Retrieved from http://www.aclweb.org/anthology/W/W14/W14-3110

top related