sentiment analysis: best practices and challenges · text preprocessing • nltk – over 50...
TRANSCRIPT
Sentiment Analysis: best practices and challenges
Vitalii Radchenko
Problem definition
• A company wants to build sentiment analysis model
• Main task is to classify review: positive or negative
• Metrics: accuracy / f1score
2
Data sources
https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3
Data sources• Open datasets:
• Amazon – 143.7 million reviews
• Imdb (50k), RT, Twitter (1.5kk)
https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3
Data sources• Open datasets:
• Amazon – 143.7 million reviews
• Imdb (50k), RT, Twitter (1.5kk)
https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3
Data sources• Open datasets:
• Amazon – 143.7 million reviews
• Imdb (50k), RT, Twitter (1.5kk)
• Parse data:
• iHerb, iTunes, RT, GoodReads, Expedia, Yelp etc.
• Remember about Terms of Usehttps://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3
Data sources• Open datasets:
• Amazon – 143.7 million reviews
• Imdb (50k), RT, Twitter (1.5kk)
• Parse data:
• iHerb, iTunes, RT, GoodReads, Expedia, Yelp etc.
• Remember about Terms of Usehttps://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/src/parsers 3
Data Analysis
https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4
Data Analysis• Very important don’t skip this step
https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data4
Data Analysis• Very important don’t skip this step
• Calculate simple statistics:
• Count reviews
• Mean number of words in review, mean length (chars)
• Look at number of words distribution (<3, 4-10, 11-50, >51)
• Count duplicates
• Check languages (with SpaCy)https://github.com/udsclub/Project1.-Sentiment-Analysis/tree/master/data
4
Data preprocessing
• Text preprocessing
• Text —> Vector
• Embeddings
• Dimensionality reduction
5
Children -> child
Better -> good
Text preprocessing
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6
Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,
tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6
Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,
tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6
Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,
tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc
• Pattern – fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6
Text preprocessing• NLTK – over 50 corpora, wordNet, tokenization, stemming,
tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
• TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc
• Pattern – fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface
• spaCy – tokenization, syntax-driven sentence segmentation, pre-trained word vectors, part-of-speech tagging, named entity recognition, labelled dependency parsing (Cython)
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-NLP_Libraries.ipynb 6
Text —> Vector
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7
Text —> Vector• Bag of Words
• CountVectorizer (ngrams, max/min_df, max_features)
• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)
• HashingVectorizer (ngrams, n_features, non_negative)
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7
Text —> Vector• Bag of Words
• CountVectorizer (ngrams, max/min_df, max_features)
• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)
• HashingVectorizer (ngrams, n_features, non_negative)
• Sentiment features
• polarity, subjectivity (TextBlob)
• contrast conjunctions
• pos and neg smiles
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7
Text —> Vector• Bag of Words
• CountVectorizer (ngrams, max/min_df, max_features)
• Tf/Idf (ngrams, max/min_df, max_features, norm, smooth_idf)
• HashingVectorizer (ngrams, n_features, non_negative)
• Sentiment features
• polarity, subjectivity (TextBlob)
• contrast conjunctions
• pos and neg smiles
• Manual features
• count exclamation, question marks
• uppercase words
• extract rating from text (“2/10”)
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb7
Embeddings
https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8
Embeddings• Word2Vec
• pre-trained : GoogleNews 6Bx300
• gensim - fastest (available on tf)
https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8
Embeddings• Word2Vec
• pre-trained : GoogleNews 6Bx300
• gensim - fastest (available on tf)
• Glove
• pre-trained: Stanford
• C(Stanford)/tf/numpy implementation
https://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8
Embeddings• Word2Vec
• pre-trained : GoogleNews 6Bx300
• gensim - fastest (available on tf)
• Glove
• pre-trained: Stanford
• C(Stanford)/tf/numpy implementation
• HellingerPCAhttps://github.com/3Top/word2vec-api https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-word2vec_practice_gensim.ipynb 8
Dimensionality reduction
9
Dimensionality reduction
• PCA & SVD doesn’t work with sparse matrixes
9
Dimensionality reduction
• PCA & SVD doesn’t work with sparse matrixes
• TruncatedSVD
9
Approaches
10
Approaches• Linear Models:
• SVM, Logistic Regression, Naive Bayes
10
Approaches• Linear Models:
• SVM, Logistic Regression, Naive Bayes
• Trees, ensembles and boosting
• Random Forest, Extratrees, Xgboost, LightGbm
10
Approaches• Linear Models:
• SVM, Logistic Regression, Naive Bayes
• Trees, ensembles and boosting
• Random Forest, Extratrees, Xgboost, LightGbm
• FastText
10
Approaches• Linear Models:
• SVM, Logistic Regression, Naive Bayes
• Trees, ensembles and boosting
• Random Forest, Extratrees, Xgboost, LightGbm
• FastText
• Word-based NN
• LSTM, GRU, CNN
10
Approaches• Linear Models:
• SVM, Logistic Regression, Naive Bayes
• Trees, ensembles and boosting
• Random Forest, Extratrees, Xgboost, LightGbm
• FastText
• Word-based NN
• LSTM, GRU, CNN
• Char-based NN
• CNN&Dense, CNN&LSTM
10
Linear Models
https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11
Linear Models• LinearSVC with small data, Logistic Regression –
with bigger data
https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11
Linear Models• LinearSVC with small data, Logistic Regression –
with bigger data
• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)
https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11
Linear Models• LinearSVC with small data, Logistic Regression –
with bigger data
• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)
• Stemming/Lemmatization don’t work
https://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11
Linear Models• LinearSVC with small data, Logistic Regression –
with bigger data
• Count/Tf-Idf Vectorizer with many ngrams and regularization (min/max_df, max_features)
• Stemming/Lemmatization don’t work
• Remove stopwords with cross-validationhttps://github.com/udsclub/workshop/blob/master/notebooks/USDC-workshop-Linear_models__svm_logistic_regression.ipynb https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-working-with-linear-models.ipynb https://github.com/udsclub/xray-sentiment-analysis https://github.com/udsclub/zulu-sentiment-analysis 11
Trees, ensembles and boosting
https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb
12
Trees, ensembles and boosting
https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb
12
Trees, ensembles and boosting
https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb
12
Trees, ensembles and boosting
• The worst models for sentiment analysis :)
https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb
12
Trees, ensembles and boosting
• The worst models for sentiment analysis :)
• Overfit
https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb
12
Trees, ensembles and boosting
• The worst models for sentiment analysis :)
• Overfit
• Works good in ensemble with linear models
https://github.com/udsclub/kilo-sentiment-analysis https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-trees-and-boosting.ipynb
12
FastText
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
FastText• Very simple
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
FastText• Very simple
• Needs text preprocessing (spaCy.en + stopwords)
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
FastText• Very simple
• Needs text preprocessing (spaCy.en + stopwords)
• Pre-trained vectors wiki.en
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
FastText• Very simple
• Needs text preprocessing (spaCy.en + stopwords)
• Pre-trained vectors wiki.en
• Tune regularization parameters
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
FastText• Very simple
• Needs text preprocessing (spaCy.en + stopwords)
• Pre-trained vectors wiki.en
• Tune regularization parameters
• Good result
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
FastText• Very simple
• Needs text preprocessing (spaCy.en + stopwords)
• Pre-trained vectors wiki.en
• Tune regularization parameters
• Good result
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
FastText• Very simple
• Needs text preprocessing (spaCy.en + stopwords)
• Pre-trained vectors wiki.en
• Tune regularization parameters
• Good result
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-fastText.ipynb https://github.com/udsclub/foxtrot-sentiment-analysis 13
Deal with it
Word-based NN (LSTM)
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews
14
Word-based NN (LSTM)• The best simple LSTM
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews
14
Word-based NN (LSTM)• The best simple LSTM
• Pre-trained google word2vec as embeddings
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews
14
Word-based NN (LSTM)• The best simple LSTM
• Pre-trained google word2vec as embeddings
• Truncate post, bigger maxlen is better
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews
14
Word-based NN (LSTM)• The best simple LSTM
• Pre-trained google word2vec as embeddings
• Truncate post, bigger maxlen is better
• Use masking, adam optimizer, many dropouts!
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews
14
Word-based NN (LSTM)• The best simple LSTM
• Pre-trained google word2vec as embeddings
• Truncate post, bigger maxlen is better
• Use masking, adam optimizer, many dropouts!
• Have to store a big vocabulary and embeddingshttps://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-rnn.ipynb https://github.com/udsclub/alpha-sentiment-analysis/tree/master/full_movie_reviews
14
Word-based NN (CNN)
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis
15
Word-based NN (CNN)
• 1D convolutions, maxpooling, dropouts, dense
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis
15
Word-based NN (CNN)
• 1D convolutions, maxpooling, dropouts, dense
• Better to train own embeddings
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis
15
Word-based NN (CNN)
• 1D convolutions, maxpooling, dropouts, dense
• Better to train own embeddings
• Stemming, removing stopwords
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis
15
Word-based NN (CNN)
• 1D convolutions, maxpooling, dropouts, dense
• Better to train own embeddings
• Stemming, removing stopwords
• Works worse than LSTM
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-CNN.ipynb https://github.com/udsclub/charlie-sentiment-analysis
15
Char-based NN
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16
Char-based NN• Two approaches for preparing data:
• OHE (70 symbols)
• Embeddings
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16
Char-based NN• Two approaches for preparing data:
• OHE (70 symbols)
• Embeddings
• Two most popular approaches for training model:
• N*(Conv1d) + GlobalMaxPooling + Dense
• N*(Conv1d + MaxPooling) + LSTM
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16
Char-based NN• Two approaches for preparing data:
• OHE (70 symbols)
• Embeddings
• Two most popular approaches for training model:
• N*(Conv1d) + GlobalMaxPooling + Dense
• N*(Conv1d + MaxPooling) + LSTM
• Using OHE or Embeddings we don’t need to store a big vocabulary
https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-char-models.ipynb16
My ranking (small data)1. Word-based LSTM
2. Linear models
3. Char-based CNN + LSTM
4. FastText
5. Word-based CNN
6. Boosting
17
My ranking (big data)1. Word-based LSTM
2. Char-based CNN + LSTM
3. FastText
4. Linear models (log reg)
5. Word-based CNN
6. Boosting
18
Observations 1
19
Observations 1• Small number of short reviews
• Linear Models with BoW (many ngrams and big regularization)
19
Observations 1• Small number of short reviews
• Linear Models with BoW (many ngrams and big regularization)
• Small number of long reviews
• one layer LSTM with pre-trained google word2vec and many dropouts
19
Observations 1• Small number of short reviews
• Linear Models with BoW (many ngrams and big regularization)
• Small number of long reviews
• one layer LSTM with pre-trained google word2vec and many dropouts
• Many reviews
• LSTM and char-CNN
19
Observations 2
https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20
Observations 2• Begin with simple models one-layer LSTM or
Logistic Regression
https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20
Observations 2• Begin with simple models one-layer LSTM or
Logistic Regression
• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM
https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20
Observations 2• Begin with simple models one-layer LSTM or
Logistic Regression
• Complex LSTMs (Bidirectional, Stacked, Merged, with Attention) don’t usually work better than simple LSTM
• LSTM with Attention gives the biggest weights to the last words
https://github.com/udsclub/whiskey-sentiment-analysis/blob/master/test-attention.ipynb 20
Observations 3• Imbalanced dataset lead to the big overfitting on
smaller class on test set.
• Pay attention to F1score and classification report
• If you have many reviews, just remove some samples from the bigger class
https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb 21
Observations 4• Predict other domains
• use Amazon dataset (works good)
• Trained on Amazon Movie and TV 1.5kk reviews (LSTM, other models lose more than 1%)
• “Digital Music” – 95.82% “Office Products” – 95.76% “Video Games” – 94.08%
https://github.com/udsclub/alpha-sentiment-analysis/blob/master/amazonTv/scripts/validation_curves.ipynb
22
Challenges
https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset
23
Challenges• Enrich dataset with synonyms
• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)
https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset
23
Challenges• Enrich dataset with synonyms
• synonymscrawler, word2vec (the closest vector, cosine distance), wordnet (works bad)
• Transfer learning on other languages
• train on English and transfer to other languages with the same chars (works good)
https://github.com/udsclub/alpha-sentiment-analysis/tree/master/Enrichment%20dataset
23
Contact me• OpenDataScience – @vradchenko
• Facebook – https://www.facebook.com/vitaliyradchenko127
• Email – [email protected]
• UDS Club – https://github.com/udsclub
24
Thank you