natural language processing with python
TRANSCRIPT
Natural Language Processing
with Python
Kodliuk Tetiana
www.vitech.com.ua
What is NLP?Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken and as it is written.
www.vitech.com.ua
Why NLP?
NUMBERS EVERYWHERE
In the beginning was THE WORD…
www.vitech.com.ua
The most terrible – Statistics…
www.vitech.com.ua
What does statistic lie?
World Average• 6.1 Trillion Text Messages / year• 7 billion people• 3 messages/day/person
But:• Teenagers: 50 messages/day
www.vitech.com.ua
What does statistic lie? 2050• 9B people acting like teenagers • 450 billion texts/day• 164 Trillion texts/year (6 Trillion now)
www.vitech.com.ua
Why Python?
WHAT?
www.vitech.com.ua
Business problems
•Sentiment analysis
•Spam/Non-spam detection
•Similar text searching•Text specialization
www.vitech.com.ua
Liquid crystal suspensions of carbon nanotubesassisted by organically modified Laponite
nanoplatelets
If you are Scientist…
www.vitech.com.ua
Articles Similarity
● How to find similar articles?● How to find interesting news for you?● How to say if these customers are similar?● How to detect the theme of text?
www.vitech.com.ua
Word2Vec - Tomas Mikolov, 2013
Lda2Vec- Christopher Moody, 2015
Doc2Vec - Tomas Mikolov, 2014
www.vitech.com.ua
Good solution: Doc2Vec✓ “Oculist and eye-doctor … occur in almost the same environments”, Z. Harris (1954)
✓ “You shall know a word by the company it keeps!”, Firth (1957)
✓ “Tell me who your friends are and I tell you who you are”, Ukrainian
www.vitech.com.ua
Word2Vec
www.vitech.com.ua
Word2Vec
www.vitech.com.ua
Word2VecCorpus Reading
Vocabulary creating
Sub-sampling
Window moving
Feedforward Neural Network
www.vitech.com.ua
Word2Vec
It
elementary
dear Watsonmyis
CBoW
www.vitech.com.ua
Word2Vec
it [0.23, 0.45, …… 0.71]
is [0.13, 0.50, …… 0.12]
elementary [0.05, 0.89, …… 0.08]
my [0.65, 0.15, …… 0.41]
dear [0.98, 0.21, …… 0.11]
watson [0.42, 0.12, …… 0.81]
www.vitech.com.ua
Word2Vec
Sherlock Holmes cried: “Exactly, my dear Watson!”
Holmes said: Elementary, my dear fellow! Ho! Elementary“
Then Psmith murmured: “Elementary, my dear Watson, elementary,”
Holmes
Psmith
Watson
fellow
Elementary
Exactly
cried
said
www.vitech.com.ua
Doc2Vec
www.vitech.com.ua
Doc2Vec
Titanic.txt [0.23, 0.45, …… 0.71]Room.txt [0.13, 0.50, …… 0.12]Sangredus.txt [ 0.05, 0.89, …… 0.08]Umbriel.txt [0.65, 0.15, …… 0.41]Dumped.txt [0.98, 0.21, …… 0.11]Nessa.txt [0.42, 0.12, …… 0.81]
titanic [0.03, 0.89, …… 0.71]apartment [0.83, 0.50, …… 0.12]room [ 0.55, 0.89, …… 0.08]parrot [0.62, 0.15, …… 0.41]nessa [0.08, 0.21, …… 0.11]word [0.42, 0.12, …… 0.81]
Vector for Document Vector for Word
www.vitech.com.ua
Doc2Vec
LDA
Doc2VecWord2
Vec
www.vitech.com.ua
LDA
www.vitech.com.ua
LDA2Vec
www.vitech.com.ua
LDA2Vec
= 0,15*programming + 0,25*football + 0,60*beer
www.vitech.com.ua
Doc2VecTIME FOR PYTHON
www.vitech.com.ua
Why Python?
• NLTK
• Gensim
• TextBlob
• Urllib
• Pattern
• Orange
• Sklearn
www.vitech.com.ua
www.vitech.com.ua
www.vitech.com.ua
Data Sciense Flow
Target formulation
Wikipedia parsing
Text cleaning
Models
building
Results analysis
www.vitech.com.ua
Target formulation
Articles similarity for Doc2VecTopics for LDA2Vec
www.vitech.com.ua
Wikipedia parsing
www.vitech.com.ua
Text cleaning
TokenizationDigits
removingStopwords removing
Punctuation cleaning
Coding Stemming
['ukraine', 'ukrainian', 'ukraina', 'country', 'eastern', 'europe', 'bordered', 'russia', 'east', 'northeast', 'belarus', 'northwest', 'poland', 'slovakia', 'west', 'hungary', 'romania', 'moldova', 'southwest', 'black', 'azov', 'south', 'southeast', 'respectively', 'ukraine', 'currently', 'territorial', 'dispute', 'russia', 'crimean', 'peninsula', 'russia', 'annexed', 'ukraine', 'international', 'community', 'recognise', 'ukrainian', 'including', 'crimea', 'ukraine', 'area', 'making', 'largest', 'country', 'entirely', 'within', 'europe', 'largest', 'country', 'world', 'population', 'million', 'making', 'populous', 'country', 'world']
www.vitech.com.ua
Doc2Vec: LabeledSentence
“Doc_12” “Robot” “Food_cat” LDA vec
www.vitech.com.ua
Doc2Vecmodel = Doc2Vec(size=300, window=10, min_count=10, workers=4,alpha=0.025, min_alpha=0.025)
www.vitech.com.ua
Doc2Vec as Word2VecARTICLE
WORD
MALWARE
www.vitech.com.ua
www.vitech.com.ua
Doc2Vec as Word2Vec
www.vitech.com.ua
Robot
Hobbit
Programmer
Math
www.vitech.com.ua
LDA2Vec: LabeledSentence
“Doc_12” “Robot” “Food_cat” LDA vec
www.vitech.com.ua
LDAlda = gensim.models.ldamodel.LdaModel(modelled_corpus, num_topics=20, update_every=100, passes=20, id2word=dictionary, alpha='auto', eval_every=5)
www.vitech.com.ua
Useful links
1. http://u.cs.biu.ac.il/~yogo/cvsc2015.pdf2. https://radimrehurek.com/gensim/models/doc2vec.
html3. https://github.com/cemoody/lda2vec4. https://mebius.io/analysis/intro-to-LDA
www.vitech.com.ua