a first look at tf idf-pdx data science meetup

A First Look at TF-IDFDan SullivanPPPortland Data Science GroupMarch 2, 2017

Portland Data Science MeetupMarch 2, 2017

What do we do with this?

ChallengesNo obvious structure

Fully understanding language is hard

Large number of documents

Want to Find documents based on similarity

Classify documents

Fortunately ...Measuring similarity and

Classifying documents

Does not require fully understanding text

Counting Words

“PDX Data Science is all about data.”

about all data is pdx science

1 1 2 1 1 1

Corpus to Vectors

{ WordsCount (Term Frequency)

Improvement 1: Remove Stop Words

{ WordsCount

Stop Words

Improvement 2: N-grams

{ WordsCount“computer” ,“science”

“Computer science”

Example: Corpus of Machine Learning Papers

Some terms appear frequently

“Feature”

“Algorithm”

“Training”

Some less frequently“Reinforcement”

“Non-linear”

“Convolution”

IntuitionCombination of words are good indicators of topic of document

Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”

Text mining: “corpus”, “term vector”, “syntax”

Social Network: “graph”, “communities”, “users”, “influence”





Words that appear frequently across documents in a corpus are not good indicators of topic





Words that appear frequently across documents in a corpus are not good indicators of topic

Words that appear frequently only within documents about a single topic are good indicators of topic

Formalizing Intuition: TF-IDFNotation

t - a term

d - a document

D - a set of documents or corpus

N - number of documents in corpus


t - a term

d - a document



TF - term frequency

tf(t,d) is the number of times a term t occurs in document d


t - a term

d - a document



TF - term frequencytf(t,d) is the number of times a term t occurs in document d

IDF - inverse document frequencyidf(t,D) = log(N / | {d in D: t in d} | )


t - a term

d - a document



TF - term frequencytf(t,d) is the number of times a term t occurs in document d

IDF - inverse document frequencyidf(t,D) = log(N / | {d in D: t in d} | )

TF-IDF = tf(t,d) * idf(t,D)

TF-IDF is

Large when:There is a large count of a term in

a document (large TF) and ...

Low number of documents with term in them

Small whenTerm appears in many documents

in the corpusTF-IDF

Frequency

Stop Words

Common Words

Rare Words

Improvement 3: TF-IDF

{ WordsTF-IDF

Populating a Term Vector with TF-IDFV[index(“emmy”)] = tf-idf(“emmy”,d)

V[index(“noether”)] = tf-idf(“noether”,d)

V[index(“known”)] = tf-idf(“known”,d)

V[index(“landmark”)] = tf-idf(“landmark”,d)

V[index(“contribution”)] = tf-idf(“contribution”,d)

V[index(“abstract”)] = tf-idf(“abstract”,d)

V[index(“algebra”)] = tf-idf(“algebra”,d)

V[index(“theoretical”)] = tf-idf(“theoretical”,d)

V[index(“physics”)] = tf-idf(“physics”,d)

“Emmy Noether is known for her landmark contributions to abstract algebra and theoretical physics”

Vector Space Model

Term 3

Term 2

Term 1

Doc 1

Doc 2

Doc 3

Term 1 Term 2 Term 3

Doc 1 0.4 0.1 0.6

Doc 2 0.3 0.5 0.0

Doc 3 0.0 0.2 0.6

Similarity Measures

Term 3

Term 2

Term 1

Doc 1

Doc 2

Doc 3 Euclidian Distance

Cosine

Classify by Vector (Point)

TF-IDF Vector

Text Classifier with Scikit Learn

Document Similarity with Gensim

NLP ToolsPython

Gensim

NLTK

spaCy & textacy

Scikit-Learn

TextBlob

RTM

OpenNLP (R interface)

TidyText

OtherMallet

Google Natural Language API

a first look at tf idf-pdx data science meetup

Technology