a first look at tf idf-pdx data science meetup

27
A First Look at TF-IDF Dan Sullivan PP Portland Data Science Group March 2, 2017 Portland Data Science Meetup March 2, 2017

Upload: dan-sullivan-phd

Post on 21-Apr-2017

43 views

Category:

Technology


1 download

TRANSCRIPT

A First Look at TF-IDFDan SullivanPPPortland Data Science GroupMarch 2, 2017

Portland Data Science MeetupMarch 2, 2017

What do we do with this?

ChallengesNo obvious structure

Fully understanding language is hard

Large number of documents

Want to Find documents based on similarity

Classify documents

Fortunately ...Measuring similarity and

Classifying documents

Does not require fully understanding text

Counting Words

“PDX Data Science is all about data.”

about all data is pdx science

1 1 2 1 1 1

Corpus to Vectors

{ WordsCount (Term Frequency)

Improvement 1: Remove Stop Words

{ WordsCount

Stop Words

Improvement 2: N-grams

{ WordsCount“computer” ,“science”

“Computer science”

Example: Corpus of Machine Learning Papers

Some terms appear frequently

“Feature”

“Algorithm”

“Training”

Some less frequently“Reinforcement”

“Non-linear”

“Convolution”

IntuitionCombination of words are good indicators of topic of document

Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”

Text mining: “corpus”, “term vector”, “syntax”

Social Network: “graph”, “communities”, “users”, “influence”

IntuitionCombination of words are good indicators of topic of document

Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”

Text mining: “corpus”, “term vector”, “syntax”

Social Network: “graph”, “communities”, “users”, “influence”

Words that appear frequently across documents in a corpus are not good indicators of topic

IntuitionCombination of words are good indicators of topic of document

Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”

Text mining: “corpus”, “term vector”, “syntax”

Social Network: “graph”, “communities”, “users”, “influence”

Words that appear frequently across documents in a corpus are not good indicators of topic

Words that appear frequently only within documents about a single topic are good indicators of topic

Formalizing Intuition: TF-IDFNotation

t - a term

d - a document

D - a set of documents or corpus

N - number of documents in corpus

Formalizing Intuition: TF-IDFNotation

t - a term

d - a document

D - a set of documents or corpus

N - number of documents in corpus

TF - term frequency

tf(t,d) is the number of times a term t occurs in document d

Formalizing Intuition: TF-IDFNotation

t - a term

d - a document

D - a set of documents or corpus

N - number of documents in corpus

TF - term frequencytf(t,d) is the number of times a term t occurs in document d

IDF - inverse document frequencyidf(t,D) = log(N / | {d in D: t in d} | )

Formalizing Intuition: TF-IDFNotation

t - a term

d - a document

D - a set of documents or corpus

N - number of documents in corpus

TF - term frequencytf(t,d) is the number of times a term t occurs in document d

IDF - inverse document frequencyidf(t,D) = log(N / | {d in D: t in d} | )

TF-IDF = tf(t,d) * idf(t,D)

TF-IDF is

Large when:There is a large count of a term in

a document (large TF) and ...

Low number of documents with term in them

Small whenTerm appears in many documents

in the corpusTF-IDF

Frequency

Stop Words

Common Words

Rare Words

Improvement 3: TF-IDF

{ WordsTF-IDF

Populating a Term Vector with TF-IDFV[index(“emmy”)] = tf-idf(“emmy”,d)

V[index(“noether”)] = tf-idf(“noether”,d)

V[index(“known”)] = tf-idf(“known”,d)

V[index(“landmark”)] = tf-idf(“landmark”,d)

V[index(“contribution”)] = tf-idf(“contribution”,d)

V[index(“abstract”)] = tf-idf(“abstract”,d)

V[index(“algebra”)] = tf-idf(“algebra”,d)

V[index(“theoretical”)] = tf-idf(“theoretical”,d)

V[index(“physics”)] = tf-idf(“physics”,d)

“Emmy Noether is known for her landmark contributions to abstract algebra and theoretical physics”

Vector Space Model

Term 3

Term 2

Term 1

Doc 1

Doc 2

Doc 3

Term 1 Term 2 Term 3

Doc 1 0.4 0.1 0.6

Doc 2 0.3 0.5 0.0

Doc 3 0.0 0.2 0.6

Similarity Measures

Term 3

Term 2

Term 1

Doc 1

Doc 2

Doc 3 Euclidian Distance

Cosine

Classify by Vector (Point)

TF-IDF Vector

Text Classifier with Scikit Learn

Document Similarity with Gensim

NLP ToolsPython

Gensim

NLTK

spaCy & textacy

Scikit-Learn

TextBlob

RTM

OpenNLP (R interface)

TidyText

OtherMallet

Google Natural Language API

Q & A