text analysis with machine learning

43
Dato Confidential Text Analysis with Machine Learning 1 Chris DuBois Dato, Inc.

Upload: turi-inc

Post on 15-Apr-2017

403 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Text Analysis with Machine Learning

Dato Confidential

Text Analysis with Machine Learning

1

• Chris DuBois• Dato, Inc.

Page 2: Text Analysis with Machine Learning

Dato Confidential

About meChris DuBois

Staff Data ScientistPh.D. in Statistics

Previously: probabilistic models of social networks and event data occurring over time.

Recently: tools for recommender systems, text analysis, feature engineering and model parameter search.2

[email protected] @chrisdubois

Page 3: Text Analysis with Machine Learning

Dato Confidential

About Dato

3

We aim to help developers…• develop valuable ML-based applications• deploy models as intelligent microservices

Page 4: Text Analysis with Machine Learning

Dato Confidential

Quick poll!

4

Page 5: Text Analysis with Machine Learning

Dato Confidential

Agenda• Applications of text analysis: examples and a

demo• Fundamentals• Text processing• Machine learning• Task-oriented tools

• Putting it all together• Deploying the pipeline for the application

• Quick survey of next steps• Advanced modeling techniques

5

Page 6: Text Analysis with Machine Learning

Dato Confidential

Applications of text analysis

6

Page 7: Text Analysis with Machine Learning

Dato Confidential

Product reviews and comments

7

Applications of text analysis

Page 8: Text Analysis with Machine Learning

Dato Confidential

User forums and customer service chats

8

Applications of text analysis

Page 9: Text Analysis with Machine Learning

Dato Confidential

Social media: blogs, tweets, etc.

9

Applications of text analysis

Page 10: Text Analysis with Machine Learning

Dato Confidential

News feeds

10

Applications of text analysis

Page 11: Text Analysis with Machine Learning

Dato Confidential

Speech recognition and chat bots

11

Applications of text analysis

Page 12: Text Analysis with Machine Learning

Dato Confidential

Aspect mining

12

Applications of text analysis

Page 13: Text Analysis with Machine Learning

Dato Confidential

Quick poll!

13

Page 14: Text Analysis with Machine Learning

Dato Confidential

Example of a product sentiment application

14

Page 15: Text Analysis with Machine Learning

Dato Confidential

Text processing fundamentals

15

Page 16: Text Analysis with Machine Learning

Dato Confidential16

Data

id timestamp user_name text102 2016-04-05

9:50:31Bob The food tasted

bland and uninspired.

103 2016-04-06 3:20:25

Charles I would absolutely bring my friends here. I could eat those fries every …

104 2016-04-08 11:52 Alicia Too expensive for me. Even the water was too expensive.

… … … …

int datetime str str

Page 17: Text Analysis with Machine Learning

Dato Confidential17

Transforming string columns

id text102 The food tasted bland and

uninspired.

103 I would absolutely bring my friends here. I could eat those fries every …

104 Too expensive for me. Even the water was too expensive.

… …

int str

TokenizerCountWordsNGramCounterTFIDFRemoveRareWordsSentenceSplitterExtractPartsOfSpeech

stack/unstackPipelines

Page 18: Text Analysis with Machine Learning

Dato Confidential18

Tokenization

Convert each document into a list of tokens.

“The caesar salad was amazing.”INPUT

[“The”, “caesar”, “salad”, “was”, “amazing.”]OUTPUT

Page 19: Text Analysis with Machine Learning

Dato Confidential19

Bag of words representation

Compute the number of times each word occurs.

“The ice cream was the best.”

{ “the”: 2, “ice”: 1, “cream”: 1, “was”: 1, “best”: 1}

INPUT

OUTPUT

Page 20: Text Analysis with Machine Learning

Dato Confidential20

N-Grams (words)

Compute the number of times each set of words occur.

“Give it away, give it away, give it away now”{“give it”:

3, “it away”:

3, “away give”:

2, “away now”: 1

}

INPUT

OUTPUT

Page 21: Text Analysis with Machine Learning

Dato Confidential21

N-Grams (characters)

Compute the number of times each set of characters occur.

“mississippi”{“mis”:

1, “iss”:

2, “sis”:

1, “ssi”:

2, “sip”:

1,...

}

INPUT

OUTPUT

Page 22: Text Analysis with Machine Learning

Dato Confidential22

TF-IDF representation

Rescale word counts to discriminate between common and distinctive words.

{“this”: 1, “is”: 1, “a”: 2, “example”: 3}

{“this”: .3, “is”: .1, “a”: .2, “example”: .9}

INPUT

OUTPUT

Low scores for words that are common among all documentsHigh scores for rare words occurring often in a document

Page 23: Text Analysis with Machine Learning

Dato Confidential23

Remove rare words

Remove rare words (or pre-defined stopwords).

INPUT

OUTPUT

“I like green eggs3 and ham.”

“I like green and ham.”

Page 24: Text Analysis with Machine Learning

Dato Confidential24

Split by sentence

Convert each document into a list of sentences.

INPUT

OUTPUT

“It was delicious. I will be back.”

[“It was delicious.”, “I will be back.”]

Page 25: Text Analysis with Machine Learning

Dato Confidential25

Extract parts of speech

Identify particular parts of speech.

INPUT

OUTPUT

“It was delicious. I will go back.”

[“delicious”]

Page 26: Text Analysis with Machine Learning

Dato Confidential26

stack

Rearrange the data set to “stack” the list column.

id text102 [“The food was great.”, “I

will be back.”]

103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”]

104 [“Too expensive for me.”]

… …int list

id text102 “The food was great.”

102 “I will be back.”

103 “I would absolutely bring my friends here.”

… …int str

Page 27: Text Analysis with Machine Learning

Dato Confidential27

unstack

Rearrange the data set to “unstack” the string column.

id text102 [“The food was great.”, “I

will be back.”]

103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”]

104 [“Too expensive for me.”]

… …int list

id text102 “The food was great.”

102 “I will be back.”

103 “I would absolutely bring my friends here.”

… …int str

Page 28: Text Analysis with Machine Learning

Dato Confidential28

Pipelines

Combine multiple transformations to create a single pipeline.

from graphlab.feature_engineering import WordCounter, RareWordTrimmer

f = gl.feature_engineering.create(sf, [WordCounter, RareWordTrimmer])f.fit(data)f.transform(new_data)

Page 29: Text Analysis with Machine Learning

Dato Confidential

Machine learning toolkits

29

Page 30: Text Analysis with Machine Learning

Dato Confidential

matches unstructured text queries to a set of strings

30

Task-oriented toolkits and building blocks

Autotagger =

= Nearest neighbors +NGramCounter

Page 31: Text Analysis with Machine Learning

Dato Confidential31

Nearest neighbors

id TF-IDF of text

int

dict

Reference

NearestNeighborsModel

Queryid TF-IDF of text

int dict

query_id reference_id score rank

int int float int

Result

Page 32: Text Analysis with Machine Learning

Dato Confidential32

Autotagger = Nearest Neighbors + Feature Engineering

id char ngrams

int

dict …

Reference

Result

AutotaggerModel

Queryid char ngrams …

int dict

query_id reference_id score rank

int int float int

Page 33: Text Analysis with Machine Learning

Dato Confidential33

Task-oriented toolkits and building blocks

SentimentAnalysis

predicts positive/negative sentiment in unstructured text

=

WordCounter +Logistic Classifier

=

Page 34: Text Analysis with Machine Learning

Dato Confidential34

Logistic classifier

label

feature(s)

int/str

int/float/str/dict/list

Training data

LogisticClassifier

label feature(s)

int/str int/float/str/dict/list

New data

scores

float

Probability label=1

Page 35: Text Analysis with Machine Learning

Dato Confidential35

Sentiment analysis = LogisticClassifier + Feature Engineering

label

bag of words

int/str

dict

Training data

SentimentAnalysis

label bag of words

int/str dict

New data

scores

float

Sentiment scores(probability label = 1)

Page 36: Text Analysis with Machine Learning

Dato Confidential36

Product sentiment toolkit

>>> m = gl.product_sentiment.create(sf, features=['Description'], splitby='sentence')>>> m.sentiment_summary(['billing', 'cable', 'cost', 'late', 'charges', 'slow'])

+---------+----------------+-----------------+--------------+| keyword | sd_sentiment | mean_sentiment | review_count |+---------+----------------+-----------------+--------------+| cable | 0.302471264675 | 0.285512408978 | 1618 || slow | 0.282117103769 | 0.243490314737 | 369 || cost | 0.283310577512 | 0.197087019219 | 291 || charges | 0.164350792173 | 0.0853637431588 | 1412 || late | 0.119163914305 | 0.0712757752753 | 2202 || billing | 0.159655783707 | 0.0697454360245 | 583 |+---------+----------------+-----------------+--------------+

Page 37: Text Analysis with Machine Learning

Dato Confidential

Code for deploying the demo

37

Page 38: Text Analysis with Machine Learning

Dato Confidential

Advanced topics in modeling text data

38

Page 39: Text Analysis with Machine Learning

Dato Confidential39

Topic models

Cluster words according to their co-occurrence in documents.

Page 40: Text Analysis with Machine Learning

Dato Confidential40

Word2vec

Embeddings where nearby words are used in similar contexts.

Image: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html

Page 41: Text Analysis with Machine Learning

Dato Confidential41

LSTM (Long-Short Term Memory)

Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Train a neural network to predict the next word given previous words.

Page 42: Text Analysis with Machine Learning

Dato Confidential

Conclusion• Applications of text analysis: examples and a

demo• Fundamentals• Text processing• Machine learning• Task-oriented tools

• Putting it all together• Deploying the pipeline for the application

• Quick survey of next steps• Advanced modeling techniques

42

Page 43: Text Analysis with Machine Learning

Dato Confidential

Next steps

Webinar material? Watch for an upcoming email!Help getting started? Userguide, API docs, ForumMore webinars? Benchmarking ML, data mining, and more

Contact: Chris DuBois, [email protected]

43