topic modeling of twitter followers - paris machine learning meetup - alex perrier

TOPIC MODELINGAPPLIQUÉ AUX FILS TWITTERS.

Alexis Perrier

Data & Software, Berklee College of Music, Boston

Data Science contributor

@alexip

@BerkleeOnline

@ODSC

https://twitter.com/alexip

https://twitter.com/berkleeonline

https://twitter.com/odsc

Part I: Topic Modeling

Nature et applicationAlgos et Librairies

Part II: Projet: followers sur twitter

MethodesProblemesViz

Sôrry pour les accents et anglicismes

Vue générale et rapide sur un large ensemble dedocuments

Technique non-supervisée

1 document plusieurs topics1 topic un ensemble de motsLa proportion des topics varie entre les documents

⇔⇔

ANALYSE SÉMANTIQUE DE COLLECTIONS DE DOCUMENTS

Divers CorpusLittératureJournauxDocuments o�cielsContenu en ligneRéseaux sociaux, forums, ....

Couplé a des variables externesEvolution dans le tempsAuteurs, locuteurs

ALGORITHMES

PRINCIPAUX ALGORITHMES

Approche vectorielle

Latent Semantic Analysis (LSA)

Approche probabiliste, Bayésienne

Latent Dirichlet Allocation (LDA)Structural Topic Modeling (STM), pLSA, hLDA, ...

Approche Neural Networks

convnets, ...

LATENT SEMANTIC ANALYSIS - LSA

TF-IDF: Fréquence relative des mots => VectorisationMatrice document / fréquence des motsRéduction de dimensionDécomposition en Valeur Singulière (SVD)

aka Latent Semantic Indexing (LSI)

LATENT DIRICHLET ALLOCATION

Un topic est une liste des probabilités des mots dans unvocabulaire donné.

LDA: La distribution des topics suit une loi de Dirichlet.

K: Nombre de topics: Nombre de topics par document: Nombre de mots par topicαβ

Details:https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Inférence bayésienne, Gibbs sampling, Chineserestaurant process

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

LIBRAIRIES

LIBRAIRIES

Python libraries

- Topic Modelling for HumansLDA Python library

R packages

a. lsa packageb. lda packagec. topicmodels packaged.

Java libraries: S-Space Package, MALLET

C/C++ libraries: lda-c, hlda c, ctm-c d, hdp

Gensim

stm package

https://radimrehurek.com/gensim/

http://structuraltopicmodel.com/

LE PROJET

3 articles

Topic Modeling of twitter followersSegmentation of Twitter Timelines via Topic ModelingNLP Analysis of the 2015 presidential candidatedebates

http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html

http://alexperrier.github.io/jekyll/update/2015/09/16/segmentation_twitter_timelines_lda_vs_lsa.html

http://alexperrier.github.io/jekyll/update/2015/11/12/nlp-analysis-presidential-debates.html

ETAPES:

1. Construire le corpus2. Appliquer les modeles3. Interpreter => Perplexité!

CONSTRUIRE LE CORPUS

1. Obtenir les timelines des 700 followers de :

Un document correspond a une timeline

2. Vectoriser le document

bag-of-wordsTimeline en anglais: lang = 'en' +

: tokenize, stopwords, stemming, POS

3. TF-IDF

Creer un dictionnaire de motsVectoriser les documents TF-IDFGensim, NLTK, Scikit, ....

@alexipTwython

langidNLTK

https://twitter.com/alexip

https://twython.readthedocs.org/en/latest/

https://github.com/saffsd/langid.py

http://www.nltk.org/

1) APPLIQUER LSA

Résultats pour le moins di�ciles a interpreter

2) APPLIQUER LDA

Franchement mieux

u'0.055*app + 0.045*team + 0.043*contact + 0.043*idea + 0.029*quote + 0.022*free + 0.020*development + 0.019*looking + 0.017*startup + 0.017*build',u'0.033*socialmedia + 0.022*python + 0.015*collaborative + 0.014*economy + 0.010*apple + 0.007*conda + 0.007*pydata + 0.007*talk + 0.007*check + 0.006*anaconda',u'0.053*week + 0.041*followers + 0.033*community + 0.030*insight + 0.010*follow + 0.007*world + 0.007*stats + 0.007*sharing + 0.006*unfollowers + 0.006*blog',u'0.014*thx + 0.010*event + 0.008*app + 0.007*travel + 0.006*social + 0.006*check + 0.006*marketing + 0.005*follow + 0.005*also + 0.005*time',u'0.044*docker + 0.036*prodmgmt + 0.029*product + 0.018*productmanagement + 0.017*programming + 0.012*tipoftheday + 0.010*security + 0.009*javascript + 0.009*manager + 0.009*containers',u'0.089*love + 0.035*john + 0.026*update + 0.022*heart + 0.015*peace + 0.014*beautiful + 0.012*beauty + 0.010*life + 0.010*shanti + 0.009*stories',u'0.033*geek + 0.009*architecture + 0.007*code + 0.007*products + 0.007*parts + 0.007*charts + 0.007*software + 0.006*cryptrader + 0.006*moombo + 0.006*book',u'0.049*stories + 0.046*network + 0.044*virginia + 0.044*entrepreneur + 0.039*etmchat + 0.025*etmooc + 0.021*etm + 0.015*join + 0.014*deis + 0.010*today',u'0.056*slots + 0.053*bonus + 0.052*fsiug + 0.039*casino + 0.031*slot + 0.024*online + 0.014*free + 0.013*hootchat + 0.010*win + 0.009*bonuses',u'0.056*video + 0.043*add + 0.042*message + 0.032*blog + 0.027*posts + 0.027*media + 0.025*training + 0.017*check + 0.013*gotta + 0.010*insider'

Quels sont les topics?Combien de topics?

BACK TO THE CORPUS

Nettoyage des documentsCompleter la liste des stopwords a la mainIdenti�er les anomalies: Robots, retweets, hastag, ...Ne garder que les �ls qui ont twitté récemment.

245 timelines

Visualization - LDAvis

http://nbviewer.jupyter.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis_V2.ipynb#topic=4&lambda=0.57&term=

3) STRUCTURAL TOPIC MODELING

NLP: Tokenization, stemming, stop-words, ...Nommer les topics: plusieurs groupes de mots partopic exclusivité, fréquenceNombre de topic optimum: grid search + scoringIn�uence des variables externes

STM: PRESIDENTIAL DEBATES

Primaires US6 debats: 2 democrates, 4 republicains1 document = un intervenant pendant un debat

Visualization - stmBrowser

http://alexperrier.github.io/stm-visualization/index.html

MERCI

@alexip

Slides: alexperrier.github.io

[email protected]

http://twitter.com/alexip

http://alexperrier.github.io/

mailto:[email protected]

Code & Data & Viz:

- https://github.com/alexperrier/datatalks/tree/master/twitter - https://github.com/alexperrier/datatalks/tree/master/debates - http://nbviewer.jupyter.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis_V2.ipynb- http://alexperrier.github.io/stm-visualization/index.html

Ref:

- topic modeling http://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf- lda: http://ai.stanford.edu/~ang/papers/nips01-lda.pdf - pyLDAvis: https://github.com/bmabey/pyLDAvis - stm: http://scholar.princeton.edu/files/bstewart/files/stmnips2013.pdf - stm R: http://structuraltopicmodel.com/ - stmBrowser: https://github.com/mroberts/stmBrowser

topic modeling of twitter followers - paris machine learning meetup - alex perrier

Data & Analytics