a topic analysis approach to revealing discussions on the australian twittersphere

A TOPIC ANALYSIS APPROACH TO REVEALING DISCUSSIONS ON THE AUSTRALIAN TWITTERSPHERE

Brenda MoonQueensland University of Technology

Introduction

This paper investigates techniques to identify the topics being discussed in one week of tweets from the Australian Twittersphere. Tweets were extracted from a comprehensive dataset which captures all tweets by 2.8m Australian: the Tracking Infrastructure for Social Media Analysis (TrISMA) (Bruns, Burgess & Banks et al., 2016).

Selected week: Sunday 2 August to Saturday 8 August 2015

• Thursday 6th August 2015 was used for One Day in the Life of a National Twittersphere (Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016)

• Same day used for initial development of topic modelling approach

• Then extended to full week

Latent Dirichlet Allocation

Blei, D. M. (2011)

Data cleaning

• Remove – retweets & multitweets (“rt”, “mt” or “via”)– URLs– dates, times, distances & weights– Words less than 3 characters – elipses ('...’)

• NTLK tokenisation using Twitter Tokenizer– Remove all @users and urls– Lowercase

• Convert – HTML entities to text– Hashtags to words (trim ‘#’ off hashtags)

• NLTK lemmatisation• NLTK stopwords

Hashtag pooling

• Mehrotra, Sanner, Buntine & Xie (2013) looked at different options of ‘pooling’ tweets into documents before LDA analysis to see if this could increase accuracy. They found that hashtag pooling was effective (best was hashtag pooling with clustering, but more complex to apply)

• Group all the tweets with hashtags into documents for each hashtag (some tweets will be added into more than one document)

• Tweets without hashtags stay as individual documents

Corpus filtering (Thursday 6 August 2015)

• Raw tweets: 963,064• After data cleaning: 583,528• After hashtag pooling: 516,263

– 23% of tweets had hashtags• Dictionary pruning – remove most frequent and least

frequent terms – no_above=0.5 (percent of documents), no_below=5

(documents)– 223,157 unique tokens reduced to 49,964 unique tokens

Latent Dirichlet Allocation (LDA)

• Gensim LDA (Lau & Baldwin, 2014)• LdaMulticore• Identify 30 topics• 100 passes

Thursday 6th August 2015 – overall terms

https://github.com/bmabey/pyLDAvis

Thursday 6th August 2015Topic 2: Politics / coal / China / Queensland

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Thursday 6th August 201530 topics, With hashtag pooling.

Thursday 6th August 201530 topics, With hashtag pooling.Comparison to other study

Teen culture?

1.1m tweets from 147k, to 224k accounts294k nodes total, including non-Australians535k edges from 856k @mentions / RTs

Visualisation: Gephi, Force Atlas 2Colours: Gephi, modularity resolution 1.0

Labels assigned through qualitative evaluation

Politics

Cricket

Teen CulturePop

From “One Day in the Life of a National Twittersphere” by Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016.

Further Outlook• Confirm initial topic labelling by looking at top tweets for each topic• Check whether the hashtag pooling has allowed non-hashtag tweet

topics to still be visible• Use statistical coherence of model (U_Mass Coherence, C_V

coherence) to tune LDA parameters• Model different numbers of topics (coarse/fine grain)• Relate topics per user back to our mention network graphs• Extend to the full week (or longer)• Compare to alternative approaches

– Doc2Vec / Tensorflow / dynamic LDA etc

References

• Blei, D. M. (2011). Introduction to probabilistic topic models. Communications of the ACM, 1–16. Retrieved from http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

• Mehrotra, R., Sanner, S., Buntine, W., & Xie, L. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 889–892. http://doi.org/10.1145/2484028.2484166

• Lau, J. H., & Baldwin, T. (2014). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation.

• Puschmann, C., & Scheffler, T. (2016). Topic modeling for media and communication research : A short primer (HIIG Discussion Paper Series No. 2016–5). Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2836478

• Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70. Retrieved from http://www.aclweb.org/anthology/W/W14/W14-3110

a topic analysis approach to revealing discussions on the australian twittersphere

Social Media

using clojure for sentiment analysis of the twittersphere...

revealing interfaces

twittersphere brochure

four things to look for in 2016 in the brussels...

revealing masks

immigration: debates and discussions class discussions...

kissmetrics product revealing

exploring the twittersphere: a beginner’s guide!

anti-christ usa part 2 babylon the great revealing the great...

one day in the life of a national twittersphere

nasa twittersphere: social space frontier

understanding changes in land and forest resource ... ·...

entering the twittersphere: using twitter as a learning tool

state of the twittersphere

revealing autism

the state of the twittersphere, february 2011

physics and properties of narrow gap semiconductors ·...

revealing histarch

the web goes social: blogosphere and twittersphere

the european political twittersphere