text classification powered by apache mahout and lucene
DESCRIPTION
Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features. ConfigureTRANSCRIPT
Text classificationWith Apache Mahout and Lucene
Isabel Drost-Fromm
Software Engineer at Nokia Maps*
Member of the Apache Software Foundation
Co-Founder of Berlin Buzzwords and Berlin Apache Hadoop GetTogether
Co-founder of Apache Mahout
*We are hiring, talk to me or mail [email protected]
TM
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
… provide your own success story online.
TM
Classification?
January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/
http://www.flickr.com/photos/redux/409356158/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/
http://www.flickr.com/photos/redux/409356158/
Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/
How a linear classifier sees data
Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415
Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415
Instance*
(sometimes also called example, item, or in databases a row)
Feature*
(sometimes also called attribute, signal, predictor, co-variate, or column in databases)
Label*
(sometimes also called class, target variable)
Image taken in Lisbon/ Portugal.
Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/
● Remove noise.
● Remove noise.
● Convert text to vectors.
Text consists of terms and phrases.
Encoding issues?
Chinese? Japanese?
“New York” vs. new York?
“go” vs. “going” vs. “went” vs. “gone”?
“go” vs. “Go”?
Terms? Tokens? Wait!
Now we have terms – how to turn theminto vectors?
Sunny weather
High performance computing
If we looked at two phrases only:
Aaron
Zuse
Binary bag of words
● Imagine a n-dimensional space.
● Each dimension = one possible word in texts.
● Entry in vector is one, if word occurs in text.
● Problem:
– How to know all possible terms in unknown text?
bi , j={1∀ xi∈d j0else }
Term Frequency
● Imagine a n-dimensional space.
● Each dimension = one possible word in texts.
● Entry in vector equal to the words frequency.
● Problem:
– Common words dominate vectors.
bi , j=ni , j
TF with stop wording
● Imagine a n-dimensional space.
● Each dimension = one possible word in texts.
● Filter stopwords.
● Entry in vector equal to the words frequency.
● Problem:
– Common and uncommon words with same weight.
bi , j=ni , j
TF- IDF
● Imagine a n-dimensional space.
● Each dimension = one possible word in texts.
● Filter stopwords.
● Entry in vector equal to the weighted frequency.
● Problem:
– Long texts get larger values.
bi , j=ni , j×log ∣D∣
∣{d : ti∈d }∣
Hashed feature vectors
● Imagine a n-dimensional space.
● Each word in texts = hashed to one dimension.
● Entry in vector set to one, if word hashed to it.
<
How a linear classifier sees data
LuceneAnalyzer
HTML Apache Tika Fulltext
OnlineLearner
Tokenstream+xFeatureVector
EncoderVector Model
Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415
Goals
● Did I use the best model parameters?
● How well will my model perform in the wild?
Tune modelParameters,
Experiment withTokenization,
Experiment withVector Encoding
Compute expectedperformance
Performance
● Use same data for training and testing.
● Problem:
– Highly optimistic.
– Model generalization unknown.
Performance
● Use same data for training and testing.
● Problem:
– Highly optimistic.
– Model generalization unknown.
DON'T
Performance
● Use just a fraction for training.
● Set some data aside for testing.
● Problems:
– Pessimistic predictor: Not all data used for training.
– Result may depend on which data was set aside.
Performance
● Partition your data into n fractions.
● Each fraction set aside for testing in turn.
● Problem:
– Still a pessimistic predictor.
Performance
● Use just a fraction for training.
● Set some data aside for tuning and testing.
● Problems:
– Highly optimistic.
– Parameters manually tuned to testing data.
Performance
● Use just a fraction for training.
● Set some data aside for tuning and testing.
● Problems:
– Highly optimistic.
– Parameters manually tuned to testing data.
DON'T
Performance
● Use just a fraction for training.
● Set some data aside for tuning.
● Set another set of data aside for testing.
● Problems:
– Pretty pessimistic as not all data is used.
– May depend on which data was set aside.
Performance Measures
Correct prediction: negative Correct prediction: positive
Model prediction: positive
Model prediction: negative
Accuracy
ACC=true positivetruenegative
true positive false positive false negativetruenegative
● Problems:
– What if class distribution is skewed?
Precision/ Recall
Precision=true positive
true positive false positive
Recall=true positive
true positive false negative
● Problem:
– Depends on decision threshold.
ROC Curves
ROC Curves
Orange rate
ROC Curves
False orange rate
True orange rate
ROC Curves
False orange rate
True orange rate
ROC Curves
False orange rate
True orange rate
ROC Curves
False orange rate
True orange rate
ROC Curves
False orange rate
True orange rate
AUC – area under ROC
False orange rate
True orange rate
Foto taken by fras1977http://www.flickr.com/photos/fras/4992313333/
Image by Medienmagazin prohttp://www.flickr.com/photos/medienmagazinpro/6266643422
http://www.flickr.com/photos/generated/943078008/
Math libs/ Mahout collections
Apache Hadoop-ready
Recommendations/Collaborative filtering
Classification/Logistic Regression/ SGD
Sequence learning/HMM
kNN and matrix factorizationbased Collaborative filtering
Classification/Naïve Bayes, random forest
Frequent item sets/(P)FPGrowth
Co-Location search
LDA
Clustering/ Mean shift, k-Means,Canopy, Dirichlet Process,
Image by pareeericahttp://www.flickr.com/photos/pareeerica/3711741298/
Libraries to have a look at:Vowpal Wabbit MalletLibSvm LibLinearLibfm IncanterGraphLab Skikits learn
Get your hands dirty:http://kaggle.com
https://cwiki.apache.org/confluence/display/MAHOUT/Collections
Where to get more information:“Mahout in Action” - Manning“Taming Text” - Manning“Machine Learning” - Andrew Ng
https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks
https://cwiki.apache.org/confluence/display/MAHOUT/Reference+Reading
Frameworks worth mentioning:Apache Mahout Apache GiraphMatlab/ Otave RShogun WekaRapidI MyMedialight
Where to meet these people:RecSys ICMLNIPS ECMLKDD WSDMPKDD JMLRApacheCon Berlin BuzzwordsO'Reilly Strata
Get started today with the right tools.
January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559
Discuss ideas and problems online.
November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296
Discuss ideas and problems in person.
Images taken at Berlin Buzzwords 2011/12/13 byPhilipp Kaden. See you there end of May 2014.
Become a committer yourself
http://BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany.
Online – user/[email protected], [email protected], [email protected]
Interest in solving hard problems.
Being part of lively community.
Engineering best practices.
Bug reports, patches, features.
Documentation, code, examples.
Image by: Patrick McEvoy
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/
http://www.flickr.com/photos/redux/409356158/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/
http://www.flickr.com/photos/redux/409356158/
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/