bag-of-words methods for text mining csci-ga.2590 – lecture 2a ralph grishman nyu
TRANSCRIPT
![Page 1: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/1.jpg)
Bag-of-Words MethodsforText MiningCSCI-GA.2590 – Lecture 2A
Ralph Grishman
NYU
![Page 2: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/2.jpg)
NYU 2
Bag of Words Models
• do we really need elaborate linguistic analysis?• look at text mining applications
– document retrieval– opinion mining– association mining
• see how far we can get with document-level bag-of-words models– and introduce some of our mathematical approaches
1/16/14
![Page 3: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/3.jpg)
NYU 3
• document retrieval• opinion mining• association mining
1/16/14
![Page 4: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/4.jpg)
NYU 4
Information Retrieval
• Task: given query = list of keywords, identify and rank relevant documents from collection
• Basic idea: find documents whose set of words most closely matches words in query
1/16/14
![Page 5: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/5.jpg)
NYU 5
Topic Vector
• Suppose the document collection has n distinct words, w1, …, wn
• Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document
Example• D1 = [The cat chased the mouse.]• D2 = [The dog chased the cat.]• W = [The, chased, dog, cat, mouse] (n = 5)• V1 = [ 2 , 1 , 0 , 1 , 1 ]• V2 = [ 2 , 1 , 1 , 1 , 0 ]
1/16/14
![Page 6: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/6.jpg)
NYU 6
Weighting the components
• Unusual words like elephant determine the topic much more than common words such as “the” or “have”
• can ignore words on a stop list or• weight each term frequency tfi by its inverse document
frequency idfi
• where N = size of collection and ni = number of documents
containing term i
1/16/14
witfiidfi
idfilog(Nni)
![Page 7: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/7.jpg)
NYU 7
Cosine similarity metric
Define a similarity metric between topic vectorsA common choice is cosine similarity (dot
product):
1/16/14
sim(A,B)aibii
ai2
i bi
2i
![Page 8: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/8.jpg)
NYU 8
Cosine similarity metric
• the cosine similarity metric is the cosine of the angle between the term vectors:
1/16/14
w1
w2
![Page 9: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/9.jpg)
Verdict: a success
• For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years
• stemming required for inflected languages• limited resolution: returns documents, not answers
![Page 10: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/10.jpg)
NYU 10
• document retrieval• opinion mining• association mining
1/16/14
![Page 11: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/11.jpg)
NYU 11
Opinion Mining– Task: judge whether a document expresses a positive
or negative opinion (or no opinion) about an object or topic
• classification task• valuable for producers/marketers of all sorts of products
– Simple strategy: bag-of-words• make lists of positive and negative words• see which predominate in a given document
(and mark as ‘no opinion’ if there are few words of either type
• problem: hard to make such lists– lists will differ for different topics
1/16/14
![Page 12: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/12.jpg)
NYU 12
Training a Model– Instead, label the documents in a corpus and then
train a classifier from the corpus• labeling is easier than thinking up words• in effect, learn positive and negative words from the
corpus
– probabilistic classifier• identify most likely class
s = argmax P ( t | W) t є {pos, neg}
1/16/14
![Page 13: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/13.jpg)
NYU 13
Using Bayes’ Rule
• The last step is based on the naïve assumption of independence of the word probabilities
1/16/14
argmaxt P(t |W)argmaxt P(W |t)P(t)
P(W)argmaxt P(W |t)P(t)
argmaxt P(w1,...,wn |t)P(t)
argmaxt P(wi |t)P(t)i
![Page 14: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/14.jpg)
NYU 14
Training
• We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators
• P ( t ) = count ( docs labeled t) / N
• P ( wi | t ) =
count ( docs labeled t containing wi ) count ( docs labeled t)
1/16/14
![Page 15: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/15.jpg)
NYU 15
A Problem
• Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews
• P ( positive | GR ) = ?
1/16/14
![Page 16: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/16.jpg)
NYU 16
P ( positive | GR ) = 0because P ( “mathematical” | positive ) = 0
The maximum likelihood estimate is poor when there is very little data
We need to ‘smooth’ the probabilities to avoid this problem
1/16/14
![Page 17: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/17.jpg)
NYU 17
Laplace Smoothing
• A simple remedy is to add 1 to each count– to keep them as probabilities
we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’)
– for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent)
1/16/14
ttP 1)(
![Page 18: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/18.jpg)
NYU 18
An NLTK Demo
Sentiment Analysis with Python NLTK Text Classification
• http://text-processing.com/demo/sentiment/
NLTK Code (simplified classifier)
• http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier
1/16/14
![Page 19: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/19.jpg)
NYU 19
• Is “low” a positive or a negative term?
1/16/14
![Page 20: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/20.jpg)
NYU 20
Ambiguous terms
“low” can be positive “low price”
or negative “low quality”
1/16/14
![Page 21: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/21.jpg)
NYU 21
Negation
• how to handle “the equipment never failed”
1/16/14
![Page 22: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/22.jpg)
NYU 22
Negation
• modify words following negation
“the equipment never NOT_failed”
• treat them as a separate ‘negated’ vocabulary
1/16/14
![Page 23: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/23.jpg)
NYU 23
Negation: how far to go?
“the equipment never failed and was cheap to run”
“the equipment never NOT_failed NOT_and
NOT_was NOT_cheap NOT_to NOT_run
• have to determine scope of negation
1/16/14
![Page 24: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/24.jpg)
NYU 24
Verdict: mixed
• A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails– for ambiguous terms– for negation– for comparative reviews– to reveal aspects of an opinion
• the car looked great and handled well, but the wheels kept falling off
1/16/14
![Page 25: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/25.jpg)
NYU 25
• document retrieval• opinion mining• association mining
1/16/14
![Page 26: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/26.jpg)
NYU 26
Association Mining
• Goal: find interesting relationships among attributes of an object in a large collection …objects with attribute A also have attribute B– e.g., “people who bought A also bought B”
• For text: documents with term A also have term B– widely used in scientific and medical literature
1/16/14
![Page 27: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/27.jpg)
NYU 27
Bag-of-words
• Simplest approach• look for words A and B for which
frequency (A and B in same document) >>frequency of A x frequency of B
• doesn’t work well• want to find names (of companies, products, genes),
not individual words• interested in specific types of terms• want to learn from a few examples
– need contexts to avoid noise
1/16/14
![Page 28: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/28.jpg)
NYU 28
Needed Ingredients
• Effective text association mining needs– name recognition– term classification– preferably: ability to learn patterns– preferably: a good GUI
– demo: www.coremine.com
1/16/14
![Page 29: Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU](https://reader035.vdocument.in/reader035/viewer/2022062515/56649f495503460f94c6b30f/html5/thumbnails/29.jpg)
NYU 29
Conclusion
• Some tasks can be handled effectively (and very simply) by bag-of-words models,
• but most benefit from an analysis of language structure
1/16/14