text analysis - duke university...text analysis everything data compsci216 spring 2019 outline •...
TRANSCRIPT
![Page 1: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/1.jpg)
Text Analysis
Everything DataCompSci 216 Spring 2019
![Page 2: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/2.jpg)
Outline
• Basic Text Processing– Word Counts– Tokenization• Pointwise Mutual Information
– Normalization• Stemming
2
![Page 3: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/3.jpg)
Outline
• Basic Text Processing
• Finding Salient Tokens– TFIDF scoring
3
![Page 4: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/4.jpg)
Outline
• Basic Text Processing• Finding Salient Tokens
• Document Similarity & Keyword Search– Vector Space Model & Cosine Similarity– Inverted Indexes
4
![Page 5: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/5.jpg)
Outline
• Basic Text Processing– Word Counts– Tokenization• Pointwise Mutual Information
– Normalization• Stemming
• Finding Salient Tokens• Document Similarity & Keyword Search
5
![Page 6: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/6.jpg)
Basic Query: Word Counts6
Google Flu
Emotional Trends
http://www.ccs.neu.edu/home/amislove/twittermood/
![Page 7: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/7.jpg)
Basic Query: Word Counts
• How many times does each word appear in the document?
7
![Page 8: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/8.jpg)
Problem 1
• What is a word?– I’m … 1 word or 2 words {I’m} or {I, am}– State-of-the-art … 1 word or 4 words– San Francisco … 1 or 2 words
– Other Languages• French: l’ensemble• German: freundshaftsbezeigungen• Chinese: 這是一個簡單的句子 (no spaces)
8
This, one, easy, sentence
![Page 9: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/9.jpg)
Solution: Tokenization
• For English: – Splitting the text on non alphanumeric
characters is a reasonable way to find individual words.
– But will split words like “San Francisco” and “state-of-the-art”
• For other languages: – Need more sophisticated algorithms for
identifying words.
9
![Page 10: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/10.jpg)
Finding simple phrases
• Want to find pairs of tokens that always occur together (co-occurence)
• Intuition: Two tokens are co-occurentif they appear together more often than “random”– More often than monkeys
typing English words
10
http://tmblr.co/ZiEAFywGEckV
![Page 11: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/11.jpg)
Formalizing “more often than random”
• Language Model– Assigns a probability P(x1, x2, …, xk) to any
sequence of tokens.– More common sequences have a higher
probability
– Sequences of length 1: unigrams– Sequences of length 2: bigrams– Sequences of length 3: trigrams– Sequences of length n: n-grams
11
![Page 12: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/12.jpg)
Formalizing “more often than random”
• Suppose we have a language model
– P(“San Francisco”) is the probability that “San” and “Francisco” occur together (and in that order) in the language.
12
![Page 13: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/13.jpg)
Formalizing “random”: Bag of words
• Suppose we only have access to the unigram language model– Think: all unigrams in the language thrown into a
bag with counts proportional to their P(x)– Monkeys drawing words at random from the bag
– P(“San”) x P(“Francisco”) is the probability that “San Francisco” occurs together in the (random) unigram model
13
![Page 14: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/14.jpg)
Formalizing “more often than random”
• Pointwise Mutual Information:
– Positive PMI suggests word co-occurrence– Negative PMI suggests words don’t appear
together
14
![Page 15: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/15.jpg)
What is P(x)?
• “Suppose we have a language model …”
• Idea: Use counts from a large corpus of text to compute the probabilities
15
![Page 16: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/16.jpg)
What is P(x)?
• Unigram: P(x) = count(x) / N• count(x) is the number of times token x appears• N is the total # tokens.
• Bigram: P(x1, x2) = count(x1, x2) / N• count(x1,x2) = # times sequence (x1,x2) appears
• Trigram: P(x1, x2, x3) = count(x1, x2, x3) / N• count(x1,x2,x3) = # times sequence (x1,x2,x3) appears
16
![Page 17: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/17.jpg)
What is P(x)?
• “Suppose we have a language model …”
• Idea: Use counts from a large corpus of text to compute the probabilities
17
![Page 18: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/18.jpg)
Large text corpora• http://corpus.byu.edu/coca/
• Google N-gram viewer• https://books.google.com/ngrams
18
![Page 19: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/19.jpg)
Summary of Problem 1• Word tokenization can be hard
• Space/non-alphanumeric words may oversplit the text
• We can find co-ocurrent tokens: – Build a language model from a large corpus
– Check whether the pair of tokens appear more often than random using pointwise mutual information.
19
![Page 20: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/20.jpg)
Language models …
• … have many many applications
– Tokenizing long strings– Word/query completion/suggestion– Spell checking– Machine translation (NMT)– Question answering (SQUAD)– Natural Language inference (MultiNLI)– Conversational Question Answering (CoQA)– ...
20
![Page 21: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/21.jpg)
21
![Page 22: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/22.jpg)
22
https://blog.openai.com/better-language-models/
![Page 23: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/23.jpg)
23(Google)
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
![Page 24: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/24.jpg)
Problem 2
• A word may be represented in many forms– car, cars, car’s, cars’ à car– automation, automatic à automate
• Lemmatization: Problem of finding the correct dictionary headword form
24
![Page 25: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/25.jpg)
Solution: Stemming
• Words are made up of – Stems: core word– Affixes: modifiers added (often with
grammatical function)
• Stemming: reduce terms to their stems by crudely chopping off affixes– automation, automatic à automat
25
![Page 26: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/26.jpg)
Porter’s algorithm for English
• Sequences of rules applied to words
Example rule sets:
26
/(.*)sses$/ à \1ss/(.*)ies$/ à \1i/(.*)ss$/ à \1s/(.*[^s])s$/ à \1
/(.*[aeiou]+.*)ing$/ à \1
/(.*[aeiou]+.*)ed$/ à \1
![Page 27: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/27.jpg)
Any other problems?
• Same words that mean different things– Florence the person vs Florence the city– Paris Hilton (person or hotel)
• Abbreviations– I.B.M vs International Business Machines
• Different words meaning same thing– Big Apple vs New York
• …
27
![Page 28: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/28.jpg)
Any other problems?
• Same words that mean different things– Word Sense Disambiguation
• Abbreviations– Translations
• Different words meaning same thing– Named Entity Recognition & Entity Resolution
• …
28
![Page 29: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/29.jpg)
29
http://slides.com/simonescardapane/the-dark-side-of-deep-learning
Word2Vec
GloVe
![Page 30: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/30.jpg)
Outline
• Basic Text Processing
• Finding Salient Tokens– TFIDF scoring
• Document Similarity & Keyword Search
30
![Page 31: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/31.jpg)
Summarizing text31
![Page 32: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/32.jpg)
Finding salient tokens (words)
• Most frequent tokens?
32
![Page 33: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/33.jpg)
Document Frequency
• Intuition: Uninformative words appear in many documents (not just the one we are concerned about)
• Salient word:– High count within the document– Low count across documents
33
![Page 34: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/34.jpg)
TF�IDF score
• Term Frequency (TF):
c(x) is # times x appears in the document
• Inverse Document Frequency (IDF):
DF(x) is the number of documents x appears in.
34
TF ! = log!" 1+ ! ! !!!"!!!(!)!
![Page 35: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/35.jpg)
Back to summarization
• Simple heuristic:– Pick sentences S = {x1, x2, …, xk} with the
highest:
35
Salience ! = ! 1|!| TF ! ∙ IDF(!)!!∈!
!!
![Page 36: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/36.jpg)
Outline
• Basic Text Processing• Finding Salient Tokens
• Document Similarity & Keyword Search– Vector Space Model & Cosine Similarity– Inverted Indexes
36
![Page 37: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/37.jpg)
Document Similarity37
![Page 38: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/38.jpg)
Vector Space Model
• Let V = {x1, x2, …, xn} be the set of all tokens (across all documents)
• A document is a n-dimensional vector
D = [w1, w2, …, wn]
where wi = TFIDF(xi, D)
38
![Page 39: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/39.jpg)
Distance between documents
• Euclidean distance D1 = [w1, w2, …, wn]D2 = [y1, y2, …, yn]
• Why?
39
!(!1,!2) = !! !! − !! !!
!
D1 D2
![Page 40: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/40.jpg)
Cosine Similarity40
θ
D1
D2
![Page 41: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/41.jpg)
Cosine Similarity
• D1 = [w1, w2, …, wn]• D2 = [y1, y2, …, yn]
41
!(!1,!2) = !! !! ∙ !!!
!!!! !!!!
!
Dot Product
L2 Norm
![Page 42: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/42.jpg)
Keyword Search
• How to find documents that are similar to a keyword query?
• Intuition: Think of the query as another (very short) document
42
![Page 43: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/43.jpg)
Keyword Search
• Simple Algorithm
For every document D in the corpusCompute d(q, D)
Return the top-k highest scoring documents
43
![Page 44: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/44.jpg)
Does it scale?44
http://www.worldwidewebsize.com/
![Page 45: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/45.jpg)
Inverted Indexes
• Let V = {x1, x2, …, xn} be the set of alltoken
• For each token, store <token, sorted list of documents token
appears in>– <“caeser”, [1,3,4,6,7,10,…]>
• How does this help?
45
![Page 46: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/46.jpg)
Using Inverted Lists
• Documents containing “caesar”– Use the inverted list to find documents
containing “caesar”
– What additional information should we keep to compute similarity between the query and documents?
46
![Page 47: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/47.jpg)
Using Inverted Lists
• Documents containing “caesar” AND “cleopatra”– Return documents in the intersection of the
two inverted lists.
– Why is inverted list sorted on document id?
• OR? NOT? – Union and difference, resp.
47
![Page 48: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/48.jpg)
Many other things in a modern search engine …• Maintain positional information to answer
phrase queries
• Scoring is not only based on token similarity– Importance of Webpages: PageRank
(in later classes)
• User Feedback – Clicks and query reformulations
48
![Page 49: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/49.jpg)
Summary
• Word counts are very useful when analyzing text– Need good algorithms for tokenization,
stemming and other normalizations
• Algorithm for finding word co-occurrence– Language Models– Pointwise Mutual Information
49
![Page 50: Text Analysis - Duke University...Text Analysis Everything Data CompSci216 Spring 2019 Outline • Basic Text Processing – Word Counts – Tokenization • Pointwise Mutual Information](https://reader034.vdocument.in/reader034/viewer/2022050609/5fb0d57340f076430c7e2674/html5/thumbnails/50.jpg)
Summary (contd.)
• Raw counts are not sufficient to find salient tokens in a document– Term Frequency x Inverse Document
Frequency (TFIDF) scoring
• Keyword Search– Use Cosine Similarity over TFIDF scores to
compute similarity– Use Inverted Indexes to speed up processing.
50