a first look at tf idf-pdx data science meetup
TRANSCRIPT
A First Look at TF-IDFDan SullivanPPPortland Data Science GroupMarch 2, 2017
Portland Data Science MeetupMarch 2, 2017
ChallengesNo obvious structure
Fully understanding language is hard
Large number of documents
Want to Find documents based on similarity
Classify documents
Fortunately ...Measuring similarity and
Classifying documents
Does not require fully understanding text
Example: Corpus of Machine Learning Papers
Some terms appear frequently
“Feature”
“Algorithm”
“Training”
Some less frequently“Reinforcement”
“Non-linear”
“Convolution”
IntuitionCombination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
IntuitionCombination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
Words that appear frequently across documents in a corpus are not good indicators of topic
IntuitionCombination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
Words that appear frequently across documents in a corpus are not good indicators of topic
Words that appear frequently only within documents about a single topic are good indicators of topic
Formalizing Intuition: TF-IDFNotation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
Formalizing Intuition: TF-IDFNotation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d
Formalizing Intuition: TF-IDFNotation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequencytf(t,d) is the number of times a term t occurs in document d
IDF - inverse document frequencyidf(t,D) = log(N / | {d in D: t in d} | )
Formalizing Intuition: TF-IDFNotation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequencytf(t,d) is the number of times a term t occurs in document d
IDF - inverse document frequencyidf(t,D) = log(N / | {d in D: t in d} | )
TF-IDF = tf(t,d) * idf(t,D)
TF-IDF is
Large when:There is a large count of a term in
a document (large TF) and ...
Low number of documents with term in them
Small whenTerm appears in many documents
in the corpusTF-IDF
Frequency
Stop Words
Common Words
Rare Words
Populating a Term Vector with TF-IDFV[index(“emmy”)] = tf-idf(“emmy”,d)
V[index(“noether”)] = tf-idf(“noether”,d)
V[index(“known”)] = tf-idf(“known”,d)
V[index(“landmark”)] = tf-idf(“landmark”,d)
V[index(“contribution”)] = tf-idf(“contribution”,d)
V[index(“abstract”)] = tf-idf(“abstract”,d)
V[index(“algebra”)] = tf-idf(“algebra”,d)
V[index(“theoretical”)] = tf-idf(“theoretical”,d)
V[index(“physics”)] = tf-idf(“physics”,d)
“Emmy Noether is known for her landmark contributions to abstract algebra and theoretical physics”
Vector Space Model
Term 3
Term 2
Term 1
Doc 1
Doc 2
Doc 3
Term 1 Term 2 Term 3
Doc 1 0.4 0.1 0.6
Doc 2 0.3 0.5 0.0
Doc 3 0.0 0.2 0.6
NLP ToolsPython
Gensim
NLTK
spaCy & textacy
Scikit-Learn
TextBlob
RTM
OpenNLP (R interface)
TidyText
OtherMallet
Google Natural Language API