natural language processing with python

43
Natural Language Processing with Python Kodliuk Tetiana

Upload: tetiana-kodliuk

Post on 14-Apr-2017

235 views

Category:

Data & Analytics


7 download

TRANSCRIPT

Page 1: Natural Language Processing with Python

Natural Language Processing

with Python

Kodliuk Tetiana

Page 2: Natural Language Processing with Python
Page 3: Natural Language Processing with Python

www.vitech.com.ua

What is NLP?Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken and as it is written.

Page 4: Natural Language Processing with Python

www.vitech.com.ua

Why NLP?

NUMBERS EVERYWHERE

In the beginning was THE WORD…

Page 5: Natural Language Processing with Python

www.vitech.com.ua

The most terrible – Statistics…

Page 6: Natural Language Processing with Python

www.vitech.com.ua

What does statistic lie?

World Average• 6.1 Trillion Text Messages / year• 7 billion people• 3 messages/day/person

But:• Teenagers: 50 messages/day

Page 7: Natural Language Processing with Python

www.vitech.com.ua

What does statistic lie? 2050• 9B people acting like teenagers • 450 billion texts/day• 164 Trillion texts/year (6 Trillion now)

Page 8: Natural Language Processing with Python

www.vitech.com.ua

Why Python?

WHAT?

Page 9: Natural Language Processing with Python

www.vitech.com.ua

Business problems

•Sentiment analysis

•Spam/Non-spam detection

•Similar text searching•Text specialization

Page 10: Natural Language Processing with Python

www.vitech.com.ua

Liquid crystal suspensions of carbon nanotubesassisted by organically modified Laponite

nanoplatelets

If you are Scientist…

Page 11: Natural Language Processing with Python

www.vitech.com.ua

Articles Similarity

● How to find similar articles?● How to find interesting news for you?● How to say if these customers are similar?● How to detect the theme of text?

Page 12: Natural Language Processing with Python

www.vitech.com.ua

Word2Vec - Tomas Mikolov, 2013

Lda2Vec- Christopher Moody, 2015

Doc2Vec - Tomas Mikolov, 2014

Page 13: Natural Language Processing with Python

www.vitech.com.ua

Good solution: Doc2Vec✓ “Oculist and eye-doctor … occur in almost the same environments”, Z. Harris (1954)

✓ “You shall know a word by the company it keeps!”, Firth (1957)

✓ “Tell me who your friends are and I tell you who you are”, Ukrainian

Page 14: Natural Language Processing with Python

www.vitech.com.ua

Word2Vec

Page 15: Natural Language Processing with Python

www.vitech.com.ua

Word2Vec

Page 16: Natural Language Processing with Python

www.vitech.com.ua

Word2VecCorpus Reading

Vocabulary creating

Sub-sampling

Window moving

Feedforward Neural Network

Page 17: Natural Language Processing with Python

www.vitech.com.ua

Word2Vec

It

elementary

dear Watsonmyis

CBoW

Page 18: Natural Language Processing with Python

www.vitech.com.ua

Word2Vec

it [0.23, 0.45, …… 0.71]

is [0.13, 0.50, …… 0.12]

elementary [0.05, 0.89, …… 0.08]

my [0.65, 0.15, …… 0.41]

dear [0.98, 0.21, …… 0.11]

watson [0.42, 0.12, …… 0.81]

Page 19: Natural Language Processing with Python

www.vitech.com.ua

Word2Vec

Sherlock Holmes cried: “Exactly, my dear Watson!”

Holmes said: Elementary, my dear fellow! Ho! Elementary“

Then Psmith murmured: “Elementary, my dear Watson, elementary,”

Holmes

Psmith

Watson

fellow

Elementary

Exactly

cried

said

Page 20: Natural Language Processing with Python

www.vitech.com.ua

Doc2Vec

Page 21: Natural Language Processing with Python

www.vitech.com.ua

Doc2Vec

Titanic.txt [0.23, 0.45, …… 0.71]Room.txt [0.13, 0.50, …… 0.12]Sangredus.txt [ 0.05, 0.89, …… 0.08]Umbriel.txt [0.65, 0.15, …… 0.41]Dumped.txt [0.98, 0.21, …… 0.11]Nessa.txt [0.42, 0.12, …… 0.81]

titanic [0.03, 0.89, …… 0.71]apartment [0.83, 0.50, …… 0.12]room [ 0.55, 0.89, …… 0.08]parrot [0.62, 0.15, …… 0.41]nessa [0.08, 0.21, …… 0.11]word [0.42, 0.12, …… 0.81]

Vector for Document Vector for Word

Page 22: Natural Language Processing with Python

www.vitech.com.ua

Doc2Vec

LDA

Doc2VecWord2

Vec

Page 23: Natural Language Processing with Python

www.vitech.com.ua

LDA

Page 24: Natural Language Processing with Python

www.vitech.com.ua

LDA2Vec

Page 25: Natural Language Processing with Python

www.vitech.com.ua

LDA2Vec

= 0,15*programming + 0,25*football + 0,60*beer

Page 26: Natural Language Processing with Python

www.vitech.com.ua

Doc2VecTIME FOR PYTHON

Page 27: Natural Language Processing with Python

www.vitech.com.ua

Why Python?

• NLTK

• Gensim

• TextBlob

• Urllib

• Pattern

• Orange

• Sklearn

Page 28: Natural Language Processing with Python

www.vitech.com.ua

Page 29: Natural Language Processing with Python

www.vitech.com.ua

Page 30: Natural Language Processing with Python

www.vitech.com.ua

Data Sciense Flow

Target formulation

Wikipedia parsing

Text cleaning

Models

building

Results analysis

Page 31: Natural Language Processing with Python

www.vitech.com.ua

Target formulation

Articles similarity for Doc2VecTopics for LDA2Vec

Page 32: Natural Language Processing with Python

www.vitech.com.ua

Wikipedia parsing

Page 33: Natural Language Processing with Python

www.vitech.com.ua

Text cleaning

TokenizationDigits

removingStopwords removing

Punctuation cleaning

Coding Stemming

['ukraine', 'ukrainian', 'ukraina', 'country', 'eastern', 'europe', 'bordered', 'russia', 'east', 'northeast', 'belarus', 'northwest', 'poland', 'slovakia', 'west', 'hungary', 'romania', 'moldova', 'southwest', 'black', 'azov', 'south', 'southeast', 'respectively', 'ukraine', 'currently', 'territorial', 'dispute', 'russia', 'crimean', 'peninsula', 'russia', 'annexed', 'ukraine', 'international', 'community', 'recognise', 'ukrainian', 'including', 'crimea', 'ukraine', 'area', 'making', 'largest', 'country', 'entirely', 'within', 'europe', 'largest', 'country', 'world', 'population', 'million', 'making', 'populous', 'country', 'world']

Page 34: Natural Language Processing with Python

www.vitech.com.ua

Doc2Vec: LabeledSentence

“Doc_12” “Robot” “Food_cat” LDA vec

Page 35: Natural Language Processing with Python

www.vitech.com.ua

Doc2Vecmodel = Doc2Vec(size=300, window=10, min_count=10, workers=4,alpha=0.025, min_alpha=0.025)

Page 36: Natural Language Processing with Python

www.vitech.com.ua

Doc2Vec as Word2VecARTICLE

WORD

MALWARE

Page 37: Natural Language Processing with Python

www.vitech.com.ua

Page 38: Natural Language Processing with Python

www.vitech.com.ua

Doc2Vec as Word2Vec

Page 39: Natural Language Processing with Python

www.vitech.com.ua

Robot

Hobbit

Programmer

Math

Page 40: Natural Language Processing with Python

www.vitech.com.ua

LDA2Vec: LabeledSentence

“Doc_12” “Robot” “Food_cat” LDA vec

Page 41: Natural Language Processing with Python

www.vitech.com.ua

LDAlda = gensim.models.ldamodel.LdaModel(modelled_corpus, num_topics=20, update_every=100, passes=20, id2word=dictionary, alpha='auto', eval_every=5)

Page 43: Natural Language Processing with Python

www.vitech.com.ua