word embeddings as a service - pydata nyc 2015
Post on 14-Jan-2017
929 Views
Preview:
TRANSCRIPT
Word embeddings as a service
François Scharffe ( ),
PyData NYC 2015
@lechatpito (http://www.twitter.com/lechatpito) 3Top(http://www.3top.com)
Outline of the talkWhat is 3Top?What are word embeddings?How to implement a simple recommendation system for 3Top categories?
Rank Anything, Rank Everything3Top is a ranking and recommendation platformRankings convey more information than star ratings
Who cares about 3 stars or less? I just want the best stuffI'd rather trust my friends than reading through reviewsIf I have more than 3 items to rank, I can probably use a more precisecategory
Not yet launched, but the site is up
Let's take a look at http://www.3top.com (http://www.3top.com)
Places
http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side(http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side)
Movies
http://www.3top.com/category/142/movies-about-wall-street(http://www.3top.com/category/142/movies-about-wall-street)
Anything really
http://www.3top.com/category/765/foods-named-after-people(http://www.3top.com/category/765/foods-named-after-people)
Data & knowledge engineering at 3TopBuilding a solid data engineering architecture before launching the site.
Natural language processing pipelineParsing categories
Currently using a parser we developpedAbout to switch to from MatthewHonnibal. It's great, check it out!
Detecting named entities, locationsA large knowledge graph backed by an ontologyAn itemization pipeline
matching free text items to entities in the knowledge graph
spaCy (http://spacy.io)
Category recommendationHow are we going to build a simple recommendation system without having anysignificant number of user, categories or rankings?
Note the impressive figures:
Number of Users: 316Number of Rankings: 2123
Number of Categories: 1316
Feel free to add a few rankings:
W o w ! ; )
http://www.3top.com (http://www.3top.com)
Word embeddings ?Who hasn't heard about word2vec?
Word embeddings allow to represent words in a high dimensional space in a way thatwords appearing in the same context will be close in that space.
Dimensionality of the space is not that high, typically a few 100 dimensions.Word embeddings is a language modeling method, more precisely a distributedvector representation of words.
Compared to Bag of words:
Dimensionality is low, and constant wrt the vocabulary sizeDepending on the training algorithm, partially learned models give partiallygood results
Compared to topic modeling:
Better granularity, the base element is a wordPhrases vector can also be learnt
What are word embeddings models good at:
Modeling similarity between words
Allows algebric operations on word vectors
s i m ( t o m a t o , b e e f s t e a k ) < s i m ( a p p l e , t o m a t o ) < s i m ( p e a r , a p p l e )
v ( P a r i s ) - v ( F r a n c e ) ~ = v ( B e r l i n ) - v ( G e r m a n y )
ExamplesExamples here are using a GloVe model (100d, 400k vocab, trained on Wikipedia andGigaword (news articles)).
In [3]: from gensim.models import Word2Vecmodel = Word2Vec().load_word2vec_format("./glove.6B.100d.txt")
In [4]: model.most_similar("python", topn=10)
Out[4]: [(u'monty', 0.6886237859725952), (u'php', 0.586538553237915), (u'perl', 0.5784406661987305), (u'cleese', 0.5446674823760986), (u'flipper', 0.5112984776496887), (u'ruby', 0.5066927671432495), (u'spamalot', 0.505638837814331), (u'javascript', 0.5030568838119507), (u'reticulated', 0.4983375668525696), (u'monkey', 0.49764129519462585)]
In [5]: model.most_similar_cosmul(positive=["python", "programming"], topn=5)
Out[5]: [(u'perl', 0.5658619999885559), (u'scripting', 0.559501588344574), (u'scripts', 0.5469149351119995), (u'php', 0.5461974740028381), (u'language', 0.5350533127784729)]
In [6]: model.most_similar_cosmul(positive=["python", "venomous"], topn=5)
Out[6]: [(u'scorpion', 0.5413044095039368), (u'snakes', 0.5263831615447998), (u'snake', 0.5222328901290894), (u'spider', 0.5214570164680481), (u'marsupial', 0.517005205154419)]
The classical example:
v ( k i n g ) - v ( m a n ) + v ( w o m a n ) - > v ( q u e e n )
In [7]: model.most_similar_cosmul(positive=["king", "woman"], negative=["man"])
Out[7]: [(u'queen', 0.8964556455612183), (u'monarch', 0.8495977520942688), (u'throne', 0.8447030782699585), (u'princess', 0.8371668457984924), (u'elizabeth', 0.835679292678833), (u'daughter', 0.8348594903945923), (u'prince', 0.8230059742927551), (u'mother', 0.8154449462890625), (u'margaret', 0.8147734999656677), (u'father', 0.8100854158401489)]
Training a modelVery easy once you have a clean corpusGreat tools in Python
Tutorial on training a model using Gensim:
Radim Řehůřek gave a talk last year at PyData Berlin aboutoptimizations in Cython:
For GloVe
http://rare-technologies.com/word2vec-tutorial/ (http://rare-technologies.com/word2vec-tutorial/)
https://www.youtube.com/watch?v=vU4TlwZzTfU (https://www.youtube.com/watch?v=vU4TlwZzTfU)
https://github.com/maciejkula/glove-python(https://github.com/maciejkula/glove-python)
Gensim word2vec implementation specifics:
Training time ~ 8hours on a 8 proc/8 threads to learn 600 dimensions on a 1.9Bwords corpusMemory requirements depends on the vocabulary size and on the number ofdimensions:
The takes half the time but has a quadratic memory size. Check pull requests for memoryoptimizations.
3 m a t r i c e s * 4 b y t e s ( f l o a t ) * | d i m e n s i o n s | * | v o c a b u l a r y |
GloVe implementation in Python (https://github.com/maciejkula/glove-python/)
A good think to know: a bigger training set does improve the quality of the model, even forspecialized tasks.
As a consequence, you probably want to use a huge corpus. Good models are available.
Building you own model can be useful when you want to find out about the properties ofyour corpus, or you want to compare different corpora together. For examples evolutionof language in a newspaper during different periods of time.
Finding a modelFrom:
Model fileNumber ofdimensions
300
1000
1000
50/100/200/300
300
300
25/50/100/200
300
https://github.com/3Top/word2vec-api/ (https://github.com/3Top/word2vec-api/)
Google News (GoogleNews-vectors-negative300.bin.gz)
Freebase IDs(https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing)
Freebase names(https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit?usp=sharing)
Wikipedia+Gigaword 5 (http://nlp.stanford.edu/data/glove.6B.zip)
Common Crawl 42B (http://nlp.stanford.edu/data/glove.42B.300d.zip)
Common Crawl 840B (http://nlp.stanford.edu/data/glove.840B.300d.zip)
Twitter (2B Tweets) (http://www-nlp.stanford.edu/data/glove.twitter.27B.zip)
Wikipedia dependency
300
1000
Wikipedia dependency(http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2)
DBPedia vectors(https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent)
Building a recommendation engine for 3Top categoriesBy combining word vectors, we build category vectors.
In [8]: def build_category_vector(category): pass # Get the postags vector = [] for tag in postags: if tag.tagged in ['NN', 'NNS', 'JJ', 'NNP', 'NNPS', 'NNDBN', 'VBG', 'CD']: # Only keep meaningful words try: v = word2vec(tag.tagValue) # Get the word vector if v.any(): vector.append(v) except: logger.debug("Word not found in corpus: %s" % tag.tagValue) tagset.add(tag.tagValue) if vector: return matutils.unitvec(np.array(vector).mean(axis=0)) # Average the vector else: return np.empty(300)
We store those vectors in a category space, and at page load time compute the mostsimilar categories for a given category.
Now let us look at the similarity method.
s i m ( c 1 , c 2 ) = v ( c 1 ) . v ( c 2 )
In [21]: cs = CategorySimilarity()# print(Category.objects.all().count())category = Category.objects.get(category=u"Blue-collar beers that come in a can")_ = [print(c) for c in cs.most_similar_categories(category, n=5)]
DEBUG Category space size (as found in the cache): 1125
Belgian Trappist Beers Belgian Beer Cafe in NYCDark and Stormy Cocktail in NYCBrands of Ginger BeerPink Drinks
In [10]: category = Category.objects.get(category=u"Italian Restaurants in NYC.")_ = [print(c) for c in cs.most_similar_categories(category, n=5)]
Italian Restaurants in NYRestaurants in NycNYC Mexican Restaurants Romanian Restaurants in NYCThai Restaurants in NYC
In [11]: category = Category.objects.get(category=u"Coen Brothers Movies")_ = [print(c) for c in cs.most_similar_categories(category, n=10)]
Quentin Tarantino MoviesMartin Scorsese Films.Movies Starring Creepy ChildrenTim Burton MoviesMovies Starring Sean PennPixar MoviesGodfather MoviesBerlin Indie Movie TheatersKubrick MoviesHarry Potter Movies
Our recommendation system uses the Common Crawl 42B words, 300 dimensions modeltrained with GloVe.
It takes around 6GB in memory ... and this is a problem:
We run a Django server and 8 celery workers on an EC2 T2 Micro... That would be a lot ofmemory for that poor instance.
A word embedding serviceWe separate the word embedding model as a serviceSimple Flask server with a few primitives:
curl http://127.0.0.1:5000/word2vec/similarity?w1=Python&w2=Javacurl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Python&ws1=programming&ws2=Javacurl http://127.0.0.1:5000/word2vec/model?word=Pythoncurl http://127.0.0.1:5000/word2vec/most_similar?positive=king&positive=queen&negative=man
Easy to setup:python word2vec-api --model path/to/the/model [--host host --port 1234]
Get it at https://github.com/3Top/word2vec-api(https://github.com/3Top/word2vec-api)
Caching the vector spaceAs the number of categories increase we do not want to rebuild category vectors and hitthe database every time a recommendation is needed (every page access). Categoryvectors size is actually significant:
In [12]: print(np.fromstring(category.vector).nbytes)
2400
In [13]: print(u"... and for {} categories the space becomes large (~{}MB)".format(len(cs.category_space.syn0), cs.category_space.syn0.nbytes/1000000))
... and for 1125 categories the space becomes large (~2MB)
We store vectors with the category object in MySQL, using a base64 encoding of thenumpy object. Let's look at it:
In [14]: print(category._vector[:1000] + "...")
oNl7j9l8hr/2FoHhEbSyv+GALkJ6JVU/rxm5pueouL80aF72QLiivzr0z1WKaZM/i3M5FvqQmD+hDmIs7fitv+lL6cLFSbM/lSwwkBxv0D/sA+FgTxmhP5+lJZPGVLE/q8pBj07ukT9OceKjxl2jv4s4cA7RJIc/JxVUF8afnr/RQcXciUyCP+M4N3mbtrS/Ngwo85uUor/+4vqargCyP7YnHbTSv3W/MHjrh6iHsr/1vkzSmI+yP+bsS7E0B5u/JJ5iJAUNoT//xo0IJ3inP5/BwCfgWZ2/Q2r8q9Fuir/KdAOAr0OQPwzGTnUXU5i/9uQD77+xuD/1QKbEaDWjvyqfSePd7XE/3RLqJXiOrz8ZyEDICd2UP2beFLiqPZy/rIb+8sFgqr+ILyc3/5yoP5pL25IahpQ/4WpgCeuNZ7/ley+Tl9ygP+knz2odUHo/yBSdc5+Klj+GLgrafftvP2yiB76KBY2/z0RqB+1ri7+THdXBVVKvPzwZ2X+2HaA/oOThsHeidL/O7w8+bummv8Z8XCqeYas/WzQpioG6qr+JaauGrie2P7+8NmNBN5o/0Ji6XFJMpj/xAtoHvg9PPyr3OOBXVaG/M2aCbN8dv79pANKgDzNrPy4XXBNVi4S/WBuYYjWZlD8T/W3jLbOJPy3xHNTzarQ/MoOWx7aZtz/RDMwbryievwA5kQgazaO/3Ep0jVo1rD+ns3oJ3iWUv4TlEPcAnJy/dHNcwygjnr/cMGYNKPbCP5E06afPWa6/mUHAC+8mjj+NwgyjQFB5v6ffLvduuai/kBntVvsdpb8Yg3KzY/qev9r5son3VJg/h06aD0/IuD8NMHm7jGViv7o8zQzPd5U/esP4Ax6BrL8TOZuX+qGpP1WHNPzdQH0/7HXRMAqXmr9G8pkwjbenv3RvQppal7i/E5jWmLXSp792VpPxJeOjPyEKhEhl324/1E00QnHdvr9Mg0Fohd+cP6UAj0X5R5g/2umwTF42...
In [15]: # a property method takes care of the decodingdef get_vector(self): return base64.b64decode(self._vector)
def set_vector(self, value): encoded = base64.b64encode(value) self._vector = encoded vector = property(get_vector, set_vector)
In [16]: np.fromstring(category.vector)[:100]
Out[16]: array([-0.01098032, -0.07306015, 0.00129067, -0.09632728, -0.03656199, 0.01895729, 0.02399054, -0.05853978, 0.07534443, 0.25678171, 0.03339623, 0.06769982, 0.01751063, -0.03782483, 0.01130069, -0.02990636, 0.00893505, -0.08091137, -0.03629005, 0.07032291, -0.00530989, -0.07238248, 0.07250362, -0.02639468, 0.03330246, 0.04583857, -0.02866316, -0.01290668, 0.0158832 , -0.02375447, 0.09646225, -0.03751686, 0.00437724, 0.06163383, 0.02037444, -0.02757899, -0.05151945, 0.04807279, 0.02004282, -0.00287529, 0.03293298, 0.00642406, 0.02201318, 0.0039041 , -0.01417073, -0.01338945, 0.06117504, 0.03147669, -0.00503775, -0.04474968, 0.05347914, -0.05220418, 0.086543 , 0.02560141, 0.04355104, 0.00094792, -0.03385424, -0.12154957, 0.00332025, -0.01003138, 0.02011569, 0.01254879, 0.07975696, 0.09218924, -0.02945207, -0.03867418, 0.05509456, -0.0196757 , -0.02793886, -0.029431 , 0.1481371 , -0.05927895, 0.0147227 , -0.00618005, -0.04828975, -0.04124437, -0.03025203, 0.02376162, 0.09680647, -0.00224569, 0.02096485, -0.05567259, 0.05006393, 0.00714194, -0.0259668 , -0.04632226, -0.09605948, -0.04652946, 0.03884238, 0.00376863, -0.12056644, 0.02819642, 0.02371206, 0.08286085, 0.08104846, -0.03060514, -0.0313298 , -0.00715603, -0.05278924, 0.0031662 ])
In order to avoid issuing a few thousand SQL queries every time a page is loadedwe use Memcache to store the category space.As the space is larger than 1 MB we store each vector with its own key (thecategory Id). They share a common key prefix.We directly store the numpy vectors through the Gensim API.A separate key is used for the vocabulary indexes.
In [17]: def set_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set_many({"{0}-{1}".format(VEC, i): space.syn0[i] for i in range(len(space.vocab))})
This also allows to add a category vector to the space without having to rebuild it. Simplyby stacking its vector in the cache and updating the cached space indexes.
In [18]: def add_last_vector_to_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set("{}-{}".format(VEC, len(space.vocab)-1), space.syn0[-1])
UpdatesEach process gets its own copy of the vector space.Whenever a category is added, the space is updated in cache.Django signals are used to tell other processes to reload the space from cache.
Work in progressWe are about to add a few 100k generated categoriesThe category space will become large in memory: 8 workers * 2.4 kb * 100000categories = 1,9 GBIncluding entity vectors would improve results for names, places, etc.Training a specialized corpus using categories scraped all over the webTrain a phrase2vec model on these categories
Resources
Tutorials & Applications
Instagram:
Word embeddings and RNNs:
Word2vec gensim tutorial:
Clothing style search:
In digital humanities:
In digital humanities, application to gender studies:
Document classification on Yelp reviews:
http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji (http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji)
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ (http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)
http://rare-technologies.com/word2vec-tutorial/(http://rare-technologies.com/word2vec-tutorial/)
http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/ (http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/)
http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html (http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html)
http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html (http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html)
http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-inversion.ipynb(http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-
inversion.ipynb)
Resources
Academic PapersLe, Quoc V., and Tomas Mikolov. "Distributed representations of sentences anddocuments." arXiv preprint arXiv:1405.4053 (2014).JeffreyPennington, RichardSocher, and ChristopherD Manning. "Glove: Globalvectors for word representation." (2014).Levy, Omer, and Yoav Goldberg. "Dependencybased word embeddings."Proceedings of the 52nd Annual Meeting of the Association for ComputationalLinguistics. Vol. 2. 2014.Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'snegative-sampling word-embedding method." arXiv preprint arXiv:1402.3722(2014).
In [19]: Thank you !
File "<ipython-input-19-f087ca1d6988>", line 1 Thank you ! ̂SyntaxError: invalid syntax
top related