sentiment analysis: best practices and challenges · text preprocessing • nltk – over 50...

Sentiment Analysis: best practices and challenges

Vitalii Radchenko

Problem definition

• A company wants to build sentiment analysis model

• Main task is to classify review: positive or negative

• Metrics: accuracy / f1score

2

Data sources

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers

Data sources• Open datasets:

• Amazon – 143.7 million reviews

• Imdb (50k), RT, Twitter (1.5kk)

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3



Data sources• Open datasets:

• Amazon – 143.7 million reviews

• Imdb (50k), RT, Twitter (1.5kk)

• Parse data:

• iHerb, iTunes, RT, GoodReads, Expedia, Yelp etc.

• Remember about Terms of Usehttps://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3



Data Analysis

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4


Data Analysis• Very important don’t skip this step

https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4


Data Analysis• Very important don’t skip this step

• Calculate simple statistics:

• Count reviews

• Mean number of words in review, mean length (chars)

• Look at number of words distribution (<3, 4-10, 11-50, >51)

• Count duplicates

• Check languages (with SpaCy)https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data

4


Data preprocessing

• Text preprocessing

• Text —> Vector

• Embeddings

• Dimensionality reduction

5

Children -> child

Better -> good

Text preprocessing

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb

Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,

tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries





• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc






• Pattern – fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface






• Pattern – fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface

• spaCy – tokenization, syntax-driven sentence segmentation, pre-trained word vectors, part-of-speech tagging, named entity recognition, labelled dependency parsing (Cython)



Text —> Vector

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

Text —> Vector• Bag of Words

• CountVectorizer (ngrams, max/min_df, max_features)

• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)

• HashingVectorizer (ngrams, n_features, non_negative)







• Sentiment features

• polarity, subjectivity (TextBlob)

• contrast conjunctions

• pos and neg smiles







• Sentiment features

• polarity, subjectivity (TextBlob)

• contrast conjunctions

• pos and neg smiles

• Manual features

• count exclamation, question marks

• uppercase words

• extract rating from text (“2/10”)



Embeddings

https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8

https://github.com/3Top/word2vec-api

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb

Embeddings• Word2Vec

• pre-trained : GoogleNews 6Bx300

• gensim - fastest (available on tf)







• Glove

• pre-trained: Stanford

• C(Stanford)/tf/numpy implementation







• Glove

• pre-trained: Stanford

• C(Stanford)/tf/numpy implementation

• HellingerPCAhttps://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8



Dimensionality reduction

9


• PCA & SVD doesn’t work with sparse matrixes

9


• PCA & SVD doesn’t work with sparse matrixes

• TruncatedSVD

9

Approaches

10

Approaches• Linear Models:

• SVM, Logistic Regression, Naive Bayes

10



• Trees, ensembles and boosting

• Random Forest, Extratrees, Xgboost, LightGbm

10





• FastText

10





• FastText

• Word-based NN

• LSTM, GRU, CNN

10





• FastText

• Word-based NN

• LSTM, GRU, CNN

• Char-based NN

• CNN&Dense, CNN&LSTM

10

Linear Models

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11

https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb

https://github.com/udsclub/xray-sentiment-analysis

https://github.com/udsclub/zulu-sentiment-analysis

Linear Models• LinearSVC with small data, Logistic Regression –

with bigger data







with bigger data

• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)







with bigger data


• Stemming/Lemmatization don’t work







with bigger data


• Stemming/Lemmatization don’t work

• Remove stopwords with cross-validationhttps://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11





Trees, ensembles and boosting

https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb

12

https://github.com/udsclub/kilo-sentiment-analysis



• The worst models for sentiment analysis :)


12





• Overfit


12





• Overfit

• Works good in ensemble with linear models


12



FastText

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb

https://github.com/udsclub/foxtrot-sentiment-analysis

FastText• Very simple





• Needs text preprocessing (spaCy.en + stopwords)






• Pre-trained vectors wiki.en







• Tune regularization parameters








• Good result








• Good result


Deal with it



Word-based NN (LSTM)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb

https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

Word-based NN (LSTM)• The best simple LSTM


14




• Pre-trained google word2vec as embeddings


14





• Truncate post, bigger maxlen is better


14






• Use masking, adam optimizer, many dropouts!


14






• Use masking, adam optimizer, many dropouts!

• Have to store a big vocabulary and embeddingshttps://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews

14



Word-based NN (CNN)

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis

15

https://github.com/udsclub/charlie-sentiment-analysis

Word-based NN (CNN)

• 1D convolutions, maxpooling, dropouts, dense


15


Word-based NN (CNN)


• Better to train own embeddings


15


Word-based NN (CNN)



• Stemming, removing stopwords


15


Word-based NN (CNN)



• Stemming, removing stopwords

• Works worse than LSTM


15


Char-based NN

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16

https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb

Char-based NN• Two approaches for preparing data:

• OHE (70 symbols)

• Embeddings





• Embeddings

• Two most popular approaches for training model:

• N*(Conv1d) + GlobalMaxPooling + Dense

• N*(Conv1d + MaxPooling) + LSTM





• Embeddings

• Two most popular approaches for training model:

• N*(Conv1d) + GlobalMaxPooling + Dense

• N*(Conv1d + MaxPooling) + LSTM

• Using OHE or Embeddings we don’t need to store a big vocabulary



My ranking (small data)1. Word-based LSTM

2. Linear models

3. Char-based CNN + LSTM

4. FastText

5. Word-based CNN

6. Boosting

17

My ranking (big data)1. Word-based LSTM

2. Char-based CNN + LSTM

3. FastText

4. Linear models (log reg)

5. Word-based CNN

6. Boosting

18

Observations 1

19

Observations 1• Small number of short reviews

• Linear Models with BoW (many ngrams and big regularization)

19



• Small number of long reviews

• one layer LSTM with pre-trained google word2vec and many dropouts

19



• Small number of long reviews

• one layer LSTM with pre-trained google word2vec and many dropouts

• Many reviews

• LSTM and char-CNN

19

Observations 2

https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20

https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb

Observations 2• Begin with simple models one-layer LSTM or

Logistic Regression




Logistic Regression

• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM




Logistic Regression

• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM

• LSTM with Attention gives the biggest weights to the last words



Observations 3• Imbalanced dataset lead to the big overfitting on

smaller class on test set.

• Pay attention to F1score and classification report

• If you have many reviews, just remove some samples from the bigger class

https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb 21

https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb

Observations 4• Predict other domains

• use Amazon dataset (works good)

• Trained on Amazon Movie and TV 1.5kk reviews (LSTM, other models lose more than 1%)

• “Digital Music” – 95.82% “Office Products” – 95.76% “Video Games” – 94.08%


22


Challenges

https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset

23


Challenges• Enrich dataset with synonyms

• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)


23


Challenges• Enrich dataset with synonyms

• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)

• Transfer learning on other languages

• train on English and transfer to other languages with the same chars (works good)


23


Contact me• OpenDataScience – @vradchenko

• Facebook – https://www.facebook.com/vitaliyradchenko127

• Email – [email protected]

• UDS Club – https://github.com/udsclub

24

https://www.facebook.com/vitaliyradchenko127

https://www.facebook.com/vitaliyradchenko127

mailto:[email protected]

https://github.com/udsclub

Thank you

sentiment analysis: best practices and challenges · text preprocessing • nltk – over 50...

Documents