peiti li 1, shan wu 2, xiaoli chen 1 1 computer science dept. 2 statistics dept. columbia university...
TRANSCRIPT
![Page 1: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/1.jpg)
Peiti Li1, Shan Wu2, Xiaoli Chen1
1Computer Science Dept. 2Statistics Dept.
Columbia University116th Street and Broadway, New York, NY 10027, USA
introducing
Movie Review
![Page 2: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/2.jpg)
It is a fast and more direct way for people to share their opinions on a topic
Why ?
![Page 3: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/3.jpg)
Python
Twitter Search API + Stream API
![Page 4: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/4.jpg)
Opinion Mining or Sentiment Analysis
Computational study of opinions, sentiments, subjectivity, attitudes
![Page 5: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/5.jpg)
Just like a text classification task but different from topic-based text classification
In topic-based text classification (e.g., computer, sport, science), topic words are important.
But in sentiment classification, opinion/sentimentwords are more important, e.g., awesome, great, excellent, horrible, bad, worst, etc.
![Page 6: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/6.jpg)
Structure the unstructured: Natural language text is often regarded as unstructured dataBesides data mining, we need NLP technologies
Why a HARD task?
I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive,…
Credits: Bing Liu for this example
![Page 7: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/7.jpg)
Tell people whether to go to buy a movie ticket using tweets
Classify the tweet as either positive or negative
Give a rating of the movie based on tweets
![Page 8: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/8.jpg)
Different Machine Learning Approaches Accuracies
Table from: Bo Pang et al. 2002. Thumbs up? Sentiment Classification using Machine LearningTechniques. In Proc. Of the ACL, pp. 79-86. Association for Computational Linguistics
![Page 9: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/9.jpg)
Our approach is Naïve Bayes
P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)
Smoothing:
P(token | sentiment) = (count(this token in class) + 1) / (count(all tokens in class) + count(all tokens))
We didn’t use any third-party classifier, we coded our classifier all by ourselves.Reason: want to explore what is under the hook; tune the algorithm structure according to the experiment result
![Page 10: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/10.jpg)
Getting Started
.
![Page 11: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/11.jpg)
» Dev set: The movie review dataset provided by Bo Pang and Lillian Lee, Cornell University sentence_polarity_dataset_v1.0 5331 positive, 5331 negative
» Real set: Tweets about a specific movie Cannot tell exact number Twitter Search API(REST): last 6-7 days Twitter Stream API: real timeline(Drawbacks:REST API has rate limiting; Stream data takes time to collect.)
Dataset
![Page 12: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/12.jpg)
Top 100 words including stopwords
![Page 13: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/13.jpg)
Better and better but….
Baseline model is the Naïve Bayes, without any nontrivial text preprocessing; punctuations excluded, stopwords included
Tuned model still Naïve Bayes, better feature extraction technique: eliminating low information features. Best unigram model, best unigram and bigram model
![Page 14: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/14.jpg)
Dev set result:
Trainset 5000, Testset 331 Recall Specificity Accuracy
Baseline 76.13% 82.78% 79.46%
Baseline, stopwords removed
75.83% 79.46% 77.64%
Best unigram, stopwords not removed
83.99% 85.20% 84.60%
Best unigram, stopwords removed
82.78% 85.80% 84.29%
Best unigram and bigram, stop words not removed
N/A N/A 78.24%
Takes 1 hour! Intel Core i5 laptop died in the middle because of too hot for too long
Observation: definitely not consider bigrams, but still don’t know whether we should remove the stopwords
![Page 15: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/15.jpg)
5 neg, 87 pos
150 tweets
75 labeled by Xiaoli, 75 labeled by Shan
75 labeled by Xiaoli, 75 labeled by Shan
150 tweets
76 neg, 32 pos
![Page 16: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/16.jpg)
Regular expression 1: (?:@\S*|#\S*|http(?=.*://)\S*)
Regular expression 2: (#[A-Za-z0-9]+) | (@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)(All punctuations removed)
Hugo Muppets together
stopwords remv 64.13% 64.81% 64.50%
stopword incld 63.04% 54.63% 58.50%
stopwords remv 70.65% 62.96% 66.5%
stopwords incld 65.22% 53.70% 59.00%
Results on the 2 recent movies(Real set)
Which regular expression should we choose based on this result? Hard to say…. :-(
![Page 17: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/17.jpg)
.
lingPipe, Twendz, Twitter Sentiment, tweetfeel
other similar productsWe moved our attention to:
![Page 18: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/18.jpg)
twittersentiment.appspot.com
They are new too.
![Page 19: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/19.jpg)
www.tweetfeel.com
Our classifier get the exact same results with them, but wait…
![Page 20: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/20.jpg)
Two pieces of tweet made us frown :-(
![Page 21: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/21.jpg)
Emoticons play a role!!!
:-)>:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^) >:D :-D :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 :P FTW
:'( ;*( :_( T.T T_T Y.Y Y_Y >:[ :-( :( :-c :c :-< :< :-[ :[ :{ >.> <.< >.< >:\ >:/ :-/ :-. :/ :\ =/ =\ :S
![Page 22: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/22.jpg)
So we choose the regular expression that will keep emoticons
And we build a dictionary to eliminate all the punctuations that appear alone
'`','~','!','@','#','$','%','^','&','*','(',')','-','_','+','=','{','}','[',']',';',':','"',"'",'<','>',',','.','?','|','\\','/'
![Page 23: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/23.jpg)
Finally, the python begins to catch the twittering bird……..
Demo
![Page 24: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/24.jpg)
“Happy” Feet? So all tweets are positive?
We still need to do more semi-supervised learning.
1.Specific bigrams like “don’t love”
2.Finer classifier which can exclude objectives
3. Detect and remove annoying movie name like “Happy Feet”
4. Give more weights to dominant words like “excellent”, “worst”
5. Our final task: Give ratings
![Page 25: Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA](https://reader037.vdocument.in/reader037/viewer/2022110206/56649cff5503460f949d0201/html5/thumbnails/25.jpg)
Thank you all!Thank you STAT
4240!Thank you Columbia!