statistical machine learning for text classification with scikit-learn and nltk

Click here to load reader

Post on 23-Aug-2014

17.640 views

Category:

Real Estate

1 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • Statistical Learning for Text Classification with scikit-learn and NLTK Olivier Grisel http://twitter.com/ogrisel PyCon 2011
  • Outline Why text classification? What is text classification? How? scikit-learn NLTK Google Prediction API Some results
  • Applications of Text Classification Task Predicted outcomeSpam filtering Spam, Ham, PriorityLanguage guessing English, Spanish, French, ...Sentiment Analysis for Product Positive, Neutral, NegativeReviewsNews Feed Topic Politics, Business, Technology,Categorization Sports, ...Pay-per-click optimal ads Will yield clicks / WontplacementRecommender systems Will I buy this book? / I wont
  • Supervised Learning Overview Convert training data to a set of vectors of features Build a model based on the statistical properties of features in the training set, e.g. Nave Bayesian Classifier Logistic Regression Support Vector Machines For each new text document to classify Extract features Asked model to predict the most likely outcome
  • Summary Training features Text vectors Documents, Images, Sounds... Machine Learning Algorithm Labels New Text featuresDocument, vector Predictive Expected Image, Model Label Sound
  • Bags of Words Tokenize document: list of uni-grams [the, quick, brown, fox, jumps, over, the, lazy, dog] Binary occurrences / counts: {the: True, quick: True...} Frequencies: {the: 0.22, quick: 0.11, brown: 0.11, fox: 0.11} TF-IDF {the: 0.001, quick: 0.05, brown: 0.06, fox: 0.24}
  • Better than frequencies: TF-IDF Term Frequency Inverse Document Frequency Non informative words such as the are scaled done
  • Even better features bi-grams of words: New York, very bad, not good n-grams of chars: the, ed , a (useful for language guessing) Combine with: Binary occurrences Frequencies TF-IDF
  • scikit-learn
  • scikit-learn BSD numpy / scipy / cython / c++ wrappers Many state of the art implementations A new release every 3 months 17 contributors on release 0.7 Not just for text classification
  • Features Extraction in scikit-learnfrom scikits.learn.features.text import WordNGramAnalyzertext = (u"Jai mangxe9 du kangourou ce midi," u" cxe9tait pas trxeas bon.")WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)[uai, umange, udu, ukangourou, uce, umidi, uetait,upas, utres, ubon, uai mange, umange du, udukangourou, ukangourou ce, uce midi, umidi etait, uetaitpas, upas tres, utres bon]
  • Features Extraction in scikit-learnfrom scikits.learn.features.text import CharNGramAnalyzeranalyzer = CharNGramAnalyzer(min_n=3, max_n=6)char_ngrams = analyzer.analyze(text)print char_ngrams[:5] + char_ngrams[-5:][u"ja", u"ai", uai , ui m, u ma, us tres, u tres , utres b,ures bo, ues bon]
  • TF-IDF features & SVMsfrom scikits.learn.features.text.sparse import Vectorizerfrom scikits.learn.sparse.svm.sparse import LinearSVCvec = Vectorizer(analyzer=analyzer)features = vec.fit_transform(list_of_documents)clf = LinearSVC(C=100).fit(features, labels)clf2 = pickle.loads(pickle.dumps(clf))predicted_labels = clf2.predict(features_of_new_docs)
  • ) cs (do form ns .tra Training features Text c Documents, ve vectors Images, Sounds... ) Machine X,y fit( Learning . clf Algorithm Labels w) _ ne w) cs _n e (do X m ct( sfor ed i New an . pr Text c.t r clf e featuresDocument, v vector Predictive Expected Image, Model Label Sound
  • NLTK Code: ASL 2.0 & Book: CC-BY-NC-ND Tokenizers, Stemmers, Parsers, Classifiers, Clusterers, Corpus Readers
  • NLTK Corpus Downloader>>> import nltk>>> nltk.download()
  • Using a NLTK corpus>>> from nltk.corpus import movie_reviews as reviews>>> pos_ids = reviews.fileids(pos)>>> neg_ids = reviews.fileids(neg)>>> len(pos_ids), len(neg_ids)1000, 1000>>> reviews.words(pos_ids[0])[films, adapted, from, comic, books, have, ...]
  • Common data cleanup operations Lower case & remove accentuated chars:import unicodedatas = .join(c for c in unicodedata.normalize(NFD, s.lower()) if unicodedata.category(c) != Mn) Extract only word tokens of at least 2 chars Using NLTK tokenizers & stemmers Using a simple regexp: re.compile(r"bww+b", re.U).findall(s)
  • Feature Extraction with NLTK Unigram featuresdef word_features(words): return dict((word, True) for word in words)
  • Feature Extraction with NLTK Bigram Collocationsfrom nltk.collocations import BigramCollocationFinderfrom nltk.metrics import BigramAssocMeasures as BAMfrom itertools import chaindef bigram_features(words, score_fn=BAM.chi_sq): bg_finder = BigramCollocationFinder.from_words(words) bigrams = bg_finder.nbest(score_fn, 100000) return dict((bg, True) for bg in chain(words, bigrams))
  • The NLTK Nave Bayes Classifierfrom nltk.classify import NaiveBayesClassifierneg_examples = [(features(reviews.words(i)), neg) for i in neg_ids]pos_examples = [(features(reviews.words(i)), pos) for i in pos_ids]train_set = pos_examples + neg_examplesclassifier = NaiveBayesClassifier.train(train_set)
  • Most informative features>>> classifier.show_most_informative_features() magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
  • Training NLTK classifiers Try nltk-trainer python train_classifier.py --instances paras --classifier NaiveBayes bigrams --min_score 3 movie_reviews
  • REST services
  • NLTK Online demos
  • NLTK REST APIs% curl -d "text=Inception is the best movie ever" http://text-processing.com/api/sentiment/{ "probability": { "neg": 0.36647424288117808, "pos": 0.63352575711882186 }, "label": "pos"}
  • Google Prediction API
  • Typical performance results: movie reviews nltk: unigram occurrences Nave Bayesian Classifier ~ 70% Google Prediction API ~ 83% scikit-learn: TF-IDF unigram features LinearSVC ~ 87% nltk: Collocation features selection Nave Bayesian Classifier ~ 97%
  • Typical results: newsgroups topics classification 20 newsgroups dataset ~ 19K short text documents 20 categories By date train / test split Bigram TF-IDF + LinearSVC: ~ 87%
  • Confusion Matrix (20 newsgroups)00 alt.atheism01 comp.graphics02 comp.os.ms-windows.misc03 comp.sys.ibm.pc.hardware04 comp.sys.mac.hardware05 comp.windows.x06 misc.forsale07 rec.autos08 rec.motorcycles09 rec.sport.baseball10 rec.sport.hockey11 sci.crypt12 sci.electronics13 sci.med14 sci.space15 soc.religion.christian16 talk.politics.guns17 talk.politics.mideast18 talk.politics.misc19 talk.religion.misc
  • Typical results: Language Identification 15 Wikipedia articles [p.text_content() for p in html_tree.findall(//p)] CharNGramAnalyzer(min_n=1, max_n=3) TF-IDF LinearSVC
  • Typical results:Language Identification
  • Scaling to many possible outcomes Example: possible outcomes are all the categories of Wikipedia (565,108) From Document Categorization to Information Retrieval Fulltext index for TF-IDF similarity queries Smart way to find the top 30 search keywords Use Apache Lucene / Solr MoreLikeThisQuery
  • Some pointers http://scikit-learn.sf.net doc & examples http://github.com/scikit-learn code http://www.nltk.org code & doc & PDF book http://streamhacker.com/ Jacob Perkins blog on NLTK & APIs https://github.com/japerk/nltk-trainer http://www.slideshare.net/ogrisel these slides http://twitter.com/ogrisel / http://github.com/ogrisel Questions?