an introduction to gensim: "topic modelling for humans"

gensimtopic modeling for humans

William BertDC Python Meetup1 May 2012

please go to http://ADDRESS and enter a sentence

interesting relationships?

gensim generated the data for those visualizations by computing the semantic similarity of the input

who am I?

William Bertdeveloper at Carney Labs (teamcarney.com)user of gensimstill new to world of topic modelling, semantic similarity, etc

gensim: “topic modeling for humans”

topic modeling attempts to uncover the underlying semantic structure of by identifying recurring patterns of terms in a set of data (topics).

topic modelling does not parse sentences,does not care about word order, anddoes not "understand" grammar or syntax.

gensim: “topic modeling for humans”>>> lsi_model.show_topics()'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering"',

gensim isn't about topic modeling(for me, anyway)It's about similarity.

What is similarity?Some types:• String matching• Stylometry • Term frequency• Semantic (meaning)

Is A seven-year quest to collect samples from the solar system's formation ended in triumph in a dark and wet Utah desert this weekend.similar in meaning to For a month, a huge storm with massive lightning has been raging on Jupiter under the watchful eye of an orbiting spacecraft. more or less than it is similar to One of Saturn's moons is spewing a giant plume of water vapour that is feeding the planet's rings, scientists say.?

Who cares about semantic similarity?

Some use cases:• Query large collections of text• Automatic metadata• Recommendations• Better human-computer interaction

gensim.corpora

corpus = stream of vectors of document feature idsfor example, words in documents are features (“bucket of words”)

TextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]

Dictionary class

dictionary maps features (words) to feature ids (numbers)

TextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]

>>> print corpus.dictionaryDictionary(8472 unique tokens)

gensim.corpora

need massive collection of documents that ostensibly has meaningsounds like a job for wikipedia>>> wiki_corpus = WikiCorpus(articles) # articles is Wikipedia text dump bz2 file. several hours.

>>> wiki_corpus.dictionary.save("wiki_dict.dict") # persist dictionary

>>> MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # uses numpy to persist corpus in Matrix Market format. several GBs. can be BZ2’ed.

>>> wiki_corpus = MmCorpus("wiki_corpus.mm") # revive a corpus

gensim.models

transform corpora using models classesfor example, term frequency/inverse document frequency (TFIDF) transformationreflects importance of a term, not just presence/absence

>>> tfidf_trans = models.TfidfModel(wiki_corpus, id2word=dictionary) # TFIDF computes frequencies of all document features in the corpus. several hours.TfidfModel(num_docs=3430645, num_nnz=547534266)

gensim.models

>>> tfidf_trans[documents] # emits documents in TFIDF representation. documents must be in the same BOW vector space as wiki_corpus. [[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...]>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans [wiki_corpus], id2word=dictionary) # builds new corpus by iterating over documents transformed to TFIDF

>>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400) # creates LSI transformation model from tfidf corpus representation

gensim.models

topics again for a bit>>> lsi_model.show_topics()'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering"',

topics again for a bit• SVD decomposes a matrix into three simpler matrices • full rank SVD would be able to recreate the underlying matrix exactly from those three matrices• lower-rank SVD provides the best (least square error) approximation of the matrix • this approximation can find interesting relationships among data• it preserves most information while reducing noise and merging dimensions associated with terms that have similar meanings

topics again for a bit• SVD: alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

•Original paper: www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf

• General explanation: tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf

• Many more

>>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400, decay=1.0, chunksize=20000) # creates LSI transformation model from tfidf corpus representation

>>> print lsi_transLsiModel(num_terms=100000, num_topics=400, decay=1.0, chunksize=20000)

gensim.models

gensim.similarities(the best part)>>> index = Similarity(corpus=lsi_transformation[tfidf_transformation[index_corpus]], num_features=400, output_prefix=”/tmp/shard”)>>> index[lsi_trans[tfidf_trans[dictionary.doc2bow(tokenize(query))]]] # similarity of each document in the index corpus to a new query document>>> [s for s in index] # a matrix of each document’s similarities to all other documents[array([ 1. , 0. , 0.08, 0.01]), array([ 0. , 1. , 0.02, -0.02]), array([ 0.08, 0.02, 1. , 0.15]), array([ 0.01, -0.02, 0.15, 1. ])]

about gensim

created by Radim Rehurek• radimrehurek.com/gensim• github.com/piskvorky/gensim• groups.google.com/group/gensim

four additional models available

dependencies:numpyscipy

optional:PyroPattern

example code, visualization code, and ppt:github.com/sandinmyjoints

interview with Radim:williamjohnbert.com

thank you

(additional slides)

• term frequency/inverse document frequency (TFIDF)• log entropy• random projections• latent dirichlet allocation (LDA)• hierarchical dirichlet process (HDP)• latent semantic analysis/indexing (LSA/LSI)

gensim.models

Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatization

data from Lee 2005 and other papers is available in gensim for tests

slightly more about gensim

>>> lda_model.show_topics()['0.083*bridge + 0.034*dam + 0.034*river + 0.027*canal + 0.026*construction + 0.014*ferry + 0.013*bridges + 0.013*tunnel + 0.012*trail + 0.012*reservoir', '0.044*fight + 0.029*bout + 0.029*via + 0.028*martial + 0.025*boxing + 0.024*submission + 0.021*loss + 0.021*mixed + 0.020*arts + 0.020*fighting', '0.086*italian + 0.062*italy + 0.048*di + 0.024*milan + 0.019*rome + 0.014*venice + 0.013*giovanni + 0.012*della + 0.011*florence + 0.011*francesco’]

gensim: “topic modelling for humans”

an introduction to gensim: "topic modelling for humans"

Technology