word2 vec epam
TRANSCRIPT
Text Analytics elements.
Word2Vec and others
Ilya Gerasimov
Software Engineering Team Leader, Saint-Petersburg
April 4, 2015
2CONFIDENTIAL
Vector Space Model
3CONFIDENTIAL
Discrete representation
In vector space terms, this is a vector with one 1 and a lot of
zeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality: 20K (speech) – 50K (vocab) – 500K (big vocab)
– 13M (Google 1T)
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0
4CONFIDENTIAL
TF-IDF metric
5CONFIDENTIAL
Window based cooccurence matrix
Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
I like enjoy deep learnin
g
NLP flying
I x 2 1 0 0 0 0
like 2 x 0 1 0 1 0
enjoy 1 0 x 0 0 0 1
deep 0 1 0 x 1 0 0
learning 0 0 0 1 x 0 0
NLP 0 1 0 0 0 x 0
flying 0 0 1 0 0 0 x
6CONFIDENTIAL
Problems
• Increase in size with vocabulary
• Very high dimensional: require a lot of storage
• Models are less robust
7CONFIDENTIAL
Single Value Decomposition
8CONFIDENTIAL
Cosine similarity
9CONFIDENTIAL
Deep learning
10CONFIDENTIAL
N-grams & Skip grams
London is the capital of Great Britain
N-grams: [London, is] [is the capital]
[capital of Great Britain]
Skip grams: [London the capital]
[capital Britain] [London Britain]
11CONFIDENTIAL
Examples. I
12CONFIDENTIAL
Examples. II
13CONFIDENTIAL
Demo time.
Tools:
Python 2.7
Gensim https://radimrehurek.com/gensim/
Wikipedia corpus