an introduction to gensim: "topic modelling for humans"

25
gensim topic modeling for humans William Bert DC Python Meetup 1 May 2012

Upload: sandinmyjoints

Post on 27-Jan-2017

12.692 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: An Introduction to gensim: "Topic Modelling for Humans"

gensimtopic modeling for humans

William BertDC Python Meetup1 May 2012

Page 2: An Introduction to gensim: "Topic Modelling for Humans"

please go to http://ADDRESS and enter a sentence

interesting relationships?

gensim generated the data for those visualizations by computing the semantic similarity of the input

Page 3: An Introduction to gensim: "Topic Modelling for Humans"

who am I?

William Bertdeveloper at Carney Labs (teamcarney.com)user of gensimstill new to world of topic modelling, semantic similarity, etc

Page 4: An Introduction to gensim: "Topic Modelling for Humans"

gensim: “topic modeling for humans”

topic modeling attempts to uncover the underlying semantic structure of by identifying recurring patterns of terms in a set of data (topics).

topic modelling does not parse sentences,does not care about word order, anddoes not "understand" grammar or syntax.

Page 5: An Introduction to gensim: "Topic Modelling for Humans"

gensim: “topic modeling for humans”>>> lsi_model.show_topics()'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering"',

Page 6: An Introduction to gensim: "Topic Modelling for Humans"

gensim isn't about topic modeling(for me, anyway)It's about similarity.

What is similarity?Some types:• String matching• Stylometry • Term frequency• Semantic (meaning)

Page 7: An Introduction to gensim: "Topic Modelling for Humans"

Is A seven-year quest to collect samples from the solar system's formation ended in triumph in a dark and wet Utah desert this weekend.similar in meaning to For a month, a huge storm with massive lightning has been raging on Jupiter under the watchful eye of an orbiting spacecraft. more or less than it is similar to One of Saturn's moons is spewing a giant plume of water vapour that is feeding the planet's rings, scientists say.?

Page 8: An Introduction to gensim: "Topic Modelling for Humans"

Who cares about semantic similarity?

Some use cases:• Query large collections of text• Automatic metadata• Recommendations• Better human-computer interaction

Page 9: An Introduction to gensim: "Topic Modelling for Humans"

gensim.corpora

corpus = stream of vectors of document feature idsfor example, words in documents are features (“bucket of words”)

TextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]

Page 10: An Introduction to gensim: "Topic Modelling for Humans"

Dictionary class

dictionary maps features (words) to feature ids (numbers)

TextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]

>>> print corpus.dictionaryDictionary(8472 unique tokens)

gensim.corpora

Page 11: An Introduction to gensim: "Topic Modelling for Humans"

need massive collection of documents that ostensibly has meaningsounds like a job for wikipedia>>> wiki_corpus = WikiCorpus(articles) # articles is Wikipedia text dump bz2 file. several hours.

>>> wiki_corpus.dictionary.save("wiki_dict.dict") # persist dictionary

>>> MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # uses numpy to persist corpus in Matrix Market format. several GBs. can be BZ2’ed.

>>> wiki_corpus = MmCorpus("wiki_corpus.mm") # revive a corpus

Page 12: An Introduction to gensim: "Topic Modelling for Humans"

gensim.models

transform corpora using models classesfor example, term frequency/inverse document frequency (TFIDF) transformationreflects importance of a term, not just presence/absence

Page 13: An Introduction to gensim: "Topic Modelling for Humans"

>>> tfidf_trans = models.TfidfModel(wiki_corpus, id2word=dictionary) # TFIDF computes frequencies of all document features in the corpus. several hours.TfidfModel(num_docs=3430645, num_nnz=547534266)

gensim.models

>>> tfidf_trans[documents] # emits documents in TFIDF representation. documents must be in the same BOW vector space as wiki_corpus. [[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...]>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans [wiki_corpus], id2word=dictionary) # builds new corpus by iterating over documents transformed to TFIDF

Page 14: An Introduction to gensim: "Topic Modelling for Humans"

>>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400) # creates LSI transformation model from tfidf corpus representation

gensim.models

Page 15: An Introduction to gensim: "Topic Modelling for Humans"

topics again for a bit>>> lsi_model.show_topics()'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering"',

Page 16: An Introduction to gensim: "Topic Modelling for Humans"

topics again for a bit• SVD decomposes a matrix into three simpler matrices • full rank SVD would be able to recreate the underlying matrix exactly from those three matrices• lower-rank SVD provides the best (least square error) approximation of the matrix • this approximation can find interesting relationships among data• it preserves most information while reducing noise and merging dimensions associated with terms that have similar meanings

Page 17: An Introduction to gensim: "Topic Modelling for Humans"

topics again for a bit• SVD: alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

•Original paper: www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf

• General explanation: tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf

• Many more

Page 18: An Introduction to gensim: "Topic Modelling for Humans"

>>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400, decay=1.0, chunksize=20000) # creates LSI transformation model from tfidf corpus representation

>>> print lsi_transLsiModel(num_terms=100000, num_topics=400, decay=1.0, chunksize=20000)

gensim.models

Page 19: An Introduction to gensim: "Topic Modelling for Humans"

gensim.similarities(the best part)>>> index = Similarity(corpus=lsi_transformation[tfidf_transformation[index_corpus]], num_features=400, output_prefix=”/tmp/shard”)>>> index[lsi_trans[tfidf_trans[dictionary.doc2bow(tokenize(query))]]] # similarity of each document in the index corpus to a new query document>>> [s for s in index] # a matrix of each document’s similarities to all other documents[array([ 1. , 0. , 0.08, 0.01]), array([ 0. , 1. , 0.02, -0.02]), array([ 0.08, 0.02, 1. , 0.15]), array([ 0.01, -0.02, 0.15, 1. ])]

Page 20: An Introduction to gensim: "Topic Modelling for Humans"

about gensim

created by Radim Rehurek• radimrehurek.com/gensim• github.com/piskvorky/gensim• groups.google.com/group/gensim

four additional models available

dependencies:numpyscipy

optional:PyroPattern

Page 21: An Introduction to gensim: "Topic Modelling for Humans"

example code, visualization code, and ppt:github.com/sandinmyjoints

interview with Radim:williamjohnbert.com

thank you

Page 22: An Introduction to gensim: "Topic Modelling for Humans"

(additional slides)

Page 23: An Introduction to gensim: "Topic Modelling for Humans"

• term frequency/inverse document frequency (TFIDF)• log entropy• random projections• latent dirichlet allocation (LDA)• hierarchical dirichlet process (HDP)• latent semantic analysis/indexing (LSA/LSI)

gensim.models

Page 24: An Introduction to gensim: "Topic Modelling for Humans"

Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatization

data from Lee 2005 and other papers is available in gensim for tests

slightly more about gensim

Page 25: An Introduction to gensim: "Topic Modelling for Humans"

>>> lda_model.show_topics()['0.083*bridge + 0.034*dam + 0.034*river + 0.027*canal + 0.026*construction + 0.014*ferry + 0.013*bridges + 0.013*tunnel + 0.012*trail + 0.012*reservoir', '0.044*fight + 0.029*bout + 0.029*via + 0.028*martial + 0.025*boxing + 0.024*submission + 0.021*loss + 0.021*mixed + 0.020*arts + 0.020*fighting', '0.086*italian + 0.062*italy + 0.048*di + 0.024*milan + 0.019*rome + 0.014*venice + 0.013*giovanni + 0.012*della + 0.011*florence + 0.011*francesco’]

gensim: “topic modelling for humans”