word embeddings as a service - pydata nyc 2015

Word embeddings as a service

François Scharffe ( ),

PyData NYC 2015

@lechatpito (http://www.twitter.com/lechatpito) 3Top(http://www.3top.com)

Outline of the talkWhat is 3Top?What are word embeddings?How to implement a simple recommendation system for 3Top categories?

Rank Anything, Rank Everything3Top is a ranking and recommendation platformRankings convey more information than star ratings

Who cares about 3 stars or less? I just want the best stuffI'd rather trust my friends than reading through reviewsIf I have more than 3 items to rank, I can probably use a more precisecategory

Not yet launched, but the site is up

Let's take a look at http://www.3top.com (http://www.3top.com)

Places

http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side(http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side)

Movies

http://www.3top.com/category/142/movies-about-wall-street(http://www.3top.com/category/142/movies-about-wall-street)

Anything really

http://www.3top.com/category/765/foods-named-after-people(http://www.3top.com/category/765/foods-named-after-people)

Data & knowledge engineering at 3TopBuilding a solid data engineering architecture before launching the site.

Natural language processing pipelineParsing categories

Currently using a parser we developpedAbout to switch to from MatthewHonnibal. It's great, check it out!

Detecting named entities, locationsA large knowledge graph backed by an ontologyAn itemization pipeline

matching free text items to entities in the knowledge graph

spaCy (http://spacy.io)

Category recommendationHow are we going to build a simple recommendation system without having anysignificant number of user, categories or rankings?

Note the impressive figures:

Number of Users: 316Number of Rankings: 2123

Number of Categories: 1316

Feel free to add a few rankings:

W o w ! ; )

http://www.3top.com (http://www.3top.com)

Word embeddings ?Who hasn't heard about word2vec?

Word embeddings allow to represent words in a high dimensional space in a way thatwords appearing in the same context will be close in that space.

Dimensionality of the space is not that high, typically a few 100 dimensions.Word embeddings is a language modeling method, more precisely a distributedvector representation of words.

Compared to Bag of words:

Dimensionality is low, and constant wrt the vocabulary sizeDepending on the training algorithm, partially learned models give partiallygood results

Compared to topic modeling:

Better granularity, the base element is a wordPhrases vector can also be learnt

What are word embeddings models good at:

Modeling similarity between words

Allows algebric operations on word vectors

s i m ( t o m a t o , b e e f s t e a k ) < s i m ( a p p l e , t o m a t o ) < s i m ( p e a r , a p p l e )

v ( P a r i s ) - v ( F r a n c e ) ~ = v ( B e r l i n ) - v ( G e r m a n y )

ExamplesExamples here are using a GloVe model (100d, 400k vocab, trained on Wikipedia andGigaword (news articles)).

In [3]: from gensim.models import Word2Vecmodel = Word2Vec().load_word2vec_format("./glove.6B.100d.txt")

In [4]: model.most_similar("python", topn=10)

Out[4]: [(u'monty', 0.6886237859725952), (u'php', 0.586538553237915), (u'perl', 0.5784406661987305), (u'cleese', 0.5446674823760986), (u'flipper', 0.5112984776496887), (u'ruby', 0.5066927671432495), (u'spamalot', 0.505638837814331), (u'javascript', 0.5030568838119507), (u'reticulated', 0.4983375668525696), (u'monkey', 0.49764129519462585)]

In [5]: model.most_similar_cosmul(positive=["python", "programming"], topn=5)

Out[5]: [(u'perl', 0.5658619999885559), (u'scripting', 0.559501588344574), (u'scripts', 0.5469149351119995), (u'php', 0.5461974740028381), (u'language', 0.5350533127784729)]

In [6]: model.most_similar_cosmul(positive=["python", "venomous"], topn=5)

Out[6]: [(u'scorpion', 0.5413044095039368), (u'snakes', 0.5263831615447998), (u'snake', 0.5222328901290894), (u'spider', 0.5214570164680481), (u'marsupial', 0.517005205154419)]

The classical example:

v ( k i n g ) - v ( m a n ) + v ( w o m a n ) - > v ( q u e e n )

In [7]: model.most_similar_cosmul(positive=["king", "woman"], negative=["man"])

Out[7]: [(u'queen', 0.8964556455612183), (u'monarch', 0.8495977520942688), (u'throne', 0.8447030782699585), (u'princess', 0.8371668457984924), (u'elizabeth', 0.835679292678833), (u'daughter', 0.8348594903945923), (u'prince', 0.8230059742927551), (u'mother', 0.8154449462890625), (u'margaret', 0.8147734999656677), (u'father', 0.8100854158401489)]

Training a modelVery easy once you have a clean corpusGreat tools in Python

Tutorial on training a model using Gensim:

Radim Řehůřek gave a talk last year at PyData Berlin aboutoptimizations in Cython:

For GloVe

http://rare-technologies.com/word2vec-tutorial/ (http://rare-technologies.com/word2vec-tutorial/)

https://www.youtube.com/watch?v=vU4TlwZzTfU (https://www.youtube.com/watch?v=vU4TlwZzTfU)

https://github.com/maciejkula/glove-python(https://github.com/maciejkula/glove-python)

Gensim word2vec implementation specifics:

Training time ~ 8hours on a 8 proc/8 threads to learn 600 dimensions on a 1.9Bwords corpusMemory requirements depends on the vocabulary size and on the number ofdimensions:

The takes half the time but has a quadratic memory size. Check pull requests for memoryoptimizations.

3 m a t r i c e s * 4 b y t e s ( f l o a t ) * | d i m e n s i o n s | * | v o c a b u l a r y |

GloVe implementation in Python (https://github.com/maciejkula/glove-python/)

A good think to know: a bigger training set does improve the quality of the model, even forspecialized tasks.

As a consequence, you probably want to use a huge corpus. Good models are available.

Building you own model can be useful when you want to find out about the properties ofyour corpus, or you want to compare different corpora together. For examples evolutionof language in a newspaper during different periods of time.

Finding a modelFrom:

Model fileNumber ofdimensions

50/100/200/300

25/50/100/200

https://github.com/3Top/word2vec-api/ (https://github.com/3Top/word2vec-api/)

Google News (GoogleNews-vectors-negative300.bin.gz)

Freebase IDs(https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing)

Freebase names(https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit?usp=sharing)

Wikipedia+Gigaword 5 (http://nlp.stanford.edu/data/glove.6B.zip)

Common Crawl 42B (http://nlp.stanford.edu/data/glove.42B.300d.zip)

Common Crawl 840B (http://nlp.stanford.edu/data/glove.840B.300d.zip)

Twitter (2B Tweets) (http://www-nlp.stanford.edu/data/glove.twitter.27B.zip)

Wikipedia dependency

Wikipedia dependency(http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2)

DBPedia vectors(https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent)

Building a recommendation engine for 3Top categoriesBy combining word vectors, we build category vectors.

In [8]: def build_category_vector(category): pass # Get the postags vector = [] for tag in postags: if tag.tagged in ['NN', 'NNS', 'JJ', 'NNP', 'NNPS', 'NNDBN', 'VBG', 'CD']: # Only keep meaningful words try: v = word2vec(tag.tagValue) # Get the word vector if v.any(): vector.append(v) except: logger.debug("Word not found in corpus: %s" % tag.tagValue) tagset.add(tag.tagValue) if vector: return matutils.unitvec(np.array(vector).mean(axis=0)) # Average the vector else: return np.empty(300)

We store those vectors in a category space, and at page load time compute the mostsimilar categories for a given category.

Now let us look at the similarity method.

s i m ( c 1 , c 2 ) = v ( c 1 ) . v ( c 2 )

In [21]: cs = CategorySimilarity()# print(Category.objects.all().count())category = Category.objects.get(category=u"Blue-collar beers that come in a can")_ = [print(c) for c in cs.most_similar_categories(category, n=5)]

DEBUG Category space size (as found in the cache): 1125

Belgian Trappist Beers Belgian Beer Cafe in NYCDark and Stormy Cocktail in NYCBrands of Ginger BeerPink Drinks

In [10]: category = Category.objects.get(category=u"Italian Restaurants in NYC.")_ = [print(c) for c in cs.most_similar_categories(category, n=5)]

Italian Restaurants in NYRestaurants in NycNYC Mexican Restaurants Romanian Restaurants in NYCThai Restaurants in NYC

In [11]: category = Category.objects.get(category=u"Coen Brothers Movies")_ = [print(c) for c in cs.most_similar_categories(category, n=10)]

Quentin Tarantino MoviesMartin Scorsese Films.Movies Starring Creepy ChildrenTim Burton MoviesMovies Starring Sean PennPixar MoviesGodfather MoviesBerlin Indie Movie TheatersKubrick MoviesHarry Potter Movies

Our recommendation system uses the Common Crawl 42B words, 300 dimensions modeltrained with GloVe.

It takes around 6GB in memory ... and this is a problem:

We run a Django server and 8 celery workers on an EC2 T2 Micro... That would be a lot ofmemory for that poor instance.

A word embedding serviceWe separate the word embedding model as a serviceSimple Flask server with a few primitives:

curl http://127.0.0.1:5000/word2vec/similarity?w1=Python&w2=Javacurl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Python&ws1=programming&ws2=Javacurl http://127.0.0.1:5000/word2vec/model?word=Pythoncurl http://127.0.0.1:5000/word2vec/most_similar?positive=king&positive=queen&negative=man

Easy to setup:python word2vec-api --model path/to/the/model [--host host --port 1234]

Get it at https://github.com/3Top/word2vec-api(https://github.com/3Top/word2vec-api)

Caching the vector spaceAs the number of categories increase we do not want to rebuild category vectors and hitthe database every time a recommendation is needed (every page access). Categoryvectors size is actually significant:

In [12]: print(np.fromstring(category.vector).nbytes)

In [13]: print(u"... and for {} categories the space becomes large (~{}MB)".format(len(cs.category_space.syn0), cs.category_space.syn0.nbytes/1000000))

... and for 1125 categories the space becomes large (~2MB)

We store vectors with the category object in MySQL, using a base64 encoding of thenumpy object. Let's look at it:

In [14]: print(category._vector[:1000] + "...")

oNl7j9l8hr/2FoHhEbSyv+GALkJ6JVU/rxm5pueouL80aF72QLiivzr0z1WKaZM/i3M5FvqQmD+hDmIs7fitv+lL6cLFSbM/lSwwkBxv0D/sA+FgTxmhP5+lJZPGVLE/q8pBj07ukT9OceKjxl2jv4s4cA7RJIc/JxVUF8afnr/RQcXciUyCP+M4N3mbtrS/Ngwo85uUor/+4vqargCyP7YnHbTSv3W/MHjrh6iHsr/1vkzSmI+yP+bsS7E0B5u/JJ5iJAUNoT//xo0IJ3inP5/BwCfgWZ2/Q2r8q9Fuir/KdAOAr0OQPwzGTnUXU5i/9uQD77+xuD/1QKbEaDWjvyqfSePd7XE/3RLqJXiOrz8ZyEDICd2UP2beFLiqPZy/rIb+8sFgqr+ILyc3/5yoP5pL25IahpQ/4WpgCeuNZ7/ley+Tl9ygP+knz2odUHo/yBSdc5+Klj+GLgrafftvP2yiB76KBY2/z0RqB+1ri7+THdXBVVKvPzwZ2X+2HaA/oOThsHeidL/O7w8+bummv8Z8XCqeYas/WzQpioG6qr+JaauGrie2P7+8NmNBN5o/0Ji6XFJMpj/xAtoHvg9PPyr3OOBXVaG/M2aCbN8dv79pANKgDzNrPy4XXBNVi4S/WBuYYjWZlD8T/W3jLbOJPy3xHNTzarQ/MoOWx7aZtz/RDMwbryievwA5kQgazaO/3Ep0jVo1rD+ns3oJ3iWUv4TlEPcAnJy/dHNcwygjnr/cMGYNKPbCP5E06afPWa6/mUHAC+8mjj+NwgyjQFB5v6ffLvduuai/kBntVvsdpb8Yg3KzY/qev9r5son3VJg/h06aD0/IuD8NMHm7jGViv7o8zQzPd5U/esP4Ax6BrL8TOZuX+qGpP1WHNPzdQH0/7HXRMAqXmr9G8pkwjbenv3RvQppal7i/E5jWmLXSp792VpPxJeOjPyEKhEhl324/1E00QnHdvr9Mg0Fohd+cP6UAj0X5R5g/2umwTF42...

In [15]: # a property method takes care of the decodingdef get_vector(self): return base64.b64decode(self._vector)

def set_vector(self, value): encoded = base64.b64encode(value) self._vector = encoded vector = property(get_vector, set_vector)

In [16]: np.fromstring(category.vector)[:100]

Out[16]: array([-0.01098032, -0.07306015, 0.00129067, -0.09632728, -0.03656199, 0.01895729, 0.02399054, -0.05853978, 0.07534443, 0.25678171, 0.03339623, 0.06769982, 0.01751063, -0.03782483, 0.01130069, -0.02990636, 0.00893505, -0.08091137, -0.03629005, 0.07032291, -0.00530989, -0.07238248, 0.07250362, -0.02639468, 0.03330246, 0.04583857, -0.02866316, -0.01290668, 0.0158832 , -0.02375447, 0.09646225, -0.03751686, 0.00437724, 0.06163383, 0.02037444, -0.02757899, -0.05151945, 0.04807279, 0.02004282, -0.00287529, 0.03293298, 0.00642406, 0.02201318, 0.0039041 , -0.01417073, -0.01338945, 0.06117504, 0.03147669, -0.00503775, -0.04474968, 0.05347914, -0.05220418, 0.086543 , 0.02560141, 0.04355104, 0.00094792, -0.03385424, -0.12154957, 0.00332025, -0.01003138, 0.02011569, 0.01254879, 0.07975696, 0.09218924, -0.02945207, -0.03867418, 0.05509456, -0.0196757 , -0.02793886, -0.029431 , 0.1481371 , -0.05927895, 0.0147227 , -0.00618005, -0.04828975, -0.04124437, -0.03025203, 0.02376162, 0.09680647, -0.00224569, 0.02096485, -0.05567259, 0.05006393, 0.00714194, -0.0259668 , -0.04632226, -0.09605948, -0.04652946, 0.03884238, 0.00376863, -0.12056644, 0.02819642, 0.02371206, 0.08286085, 0.08104846, -0.03060514, -0.0313298 , -0.00715603, -0.05278924, 0.0031662 ])

In order to avoid issuing a few thousand SQL queries every time a page is loadedwe use Memcache to store the category space.As the space is larger than 1 MB we store each vector with its own key (thecategory Id). They share a common key prefix.We directly store the numpy vectors through the Gensim API.A separate key is used for the vocabulary indexes.

In [17]: def set_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set_many({"{0}-{1}".format(VEC, i): space.syn0[i] for i in range(len(space.vocab))})

This also allows to add a category vector to the space without having to rebuild it. Simplyby stacking its vector in the cache and updating the cached space indexes.

In [18]: def add_last_vector_to_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set("{}-{}".format(VEC, len(space.vocab)-1), space.syn0[-1])

UpdatesEach process gets its own copy of the vector space.Whenever a category is added, the space is updated in cache.Django signals are used to tell other processes to reload the space from cache.

Work in progressWe are about to add a few 100k generated categoriesThe category space will become large in memory: 8 workers * 2.4 kb * 100000categories = 1,9 GBIncluding entity vectors would improve results for names, places, etc.Training a specialized corpus using categories scraped all over the webTrain a phrase2vec model on these categories

Resources

Tutorials & Applications

Instagram:

Word embeddings and RNNs:

Word2vec gensim tutorial:

Clothing style search:

In digital humanities:

In digital humanities, application to gender studies:

Document classification on Yelp reviews:

http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji (http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji)

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ (http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

http://rare-technologies.com/word2vec-tutorial/(http://rare-technologies.com/word2vec-tutorial/)

http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/ (http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/)

http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html (http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html)

http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html (http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html)

http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-inversion.ipynb(http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-

inversion.ipynb)

Resources

Academic PapersLe, Quoc V., and Tomas Mikolov. "Distributed representations of sentences anddocuments." arXiv preprint arXiv:1405.4053 (2014).JeffreyPennington, RichardSocher, and ChristopherD Manning. "Glove: Globalvectors for word representation." (2014).Levy, Omer, and Yoav Goldberg. "Dependencybased word embeddings."Proceedings of the 52nd Annual Meeting of the Association for ComputationalLinguistics. Vol. 2. 2014.Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'snegative-sampling word-embedding method." arXiv preprint arXiv:1402.3722(2014).

In [19]: Thank you !

File "<ipython-input-19-f087ca1d6988>", line 1 Thank you ! ̂SyntaxError: invalid syntax

word embeddings as a service - pydata nyc 2015

Technology

pydata paris 2015 - track 2.3 axa

pydata: the next generation

pydata london news 7th october 2014

python business intelligence (pydata 2012 talk)

pydata lonon - finding planets with python

pydata nyc 2014 talk

data and python in biology at pydata nyc 2015

pydata paris - track 4.2 vincent feuillard

practical medium data analytics with python (10 things i...

memex - pydata seattle

pydata: past, present future (pydata sv 2014 keynote)

pydata london news 2nd september 2014

introduction to numpy (pydata sv 2013)

shogun 2.0 @ pydata nyc 2012

pydata boston 2013

vaex talk-pydata-paris

new capabilities in the pydata ecosystem

pydata london news 2nd august 2014

exploring temporal graph data with python: a study on tensor...

wide io presentation pydata london