nltk: the good, the bad, and the awesome

20
NLTK The Good, the Bad, and the Awesome

Upload: jacob-perkins

Post on 09-May-2015

13.161 views

Category:

Technology


0 download

DESCRIPTION

Presented at the first meeting of the Bay Area NLP group: http://www.meetup.com/Bay-Area-NLP/events/16522295/ by Jacob Perkins.

TRANSCRIPT

Page 1: NLTK: the Good, the Bad, and the Awesome

NLTKThe Good, the Bad, and the Awesome

Page 2: NLTK: the Good, the Bad, and the Awesome

Jacob Perkins

• Python Text Processing with NLTK 2.0 Cookbook

• streamhacker.com

• weotta.com

• text-processing.com

• @japerk

Page 3: NLTK: the Good, the Bad, and the Awesome

The Good• Makes NLProc easier and more accessible

• Python (great learning language)

• Lots of documentation (and 2 books!)

• Designed for training custom models

• Includes many training corpora

• Many algorithms to experiment with

Page 4: NLTK: the Good, the Bad, and the Awesome

The Bad

• NLProc is hard

• Few out-of-the-box solutions (see Pattern)

• Not designed for big-data (see Mahout)

• Doesn’t have latest algorithms (see Scikits-Learn)

• No online or active learning algorithms

Page 5: NLTK: the Good, the Bad, and the Awesome

More Bad

• Doesn’t play nice with pip or easy_install

• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)

• Models can use a lot of memory (& disk if pickled)

Page 6: NLTK: the Good, the Bad, and the Awesome

The Awesome

• Great for education and research

• Lots of users & active community

• Extensible interfaces

• Training algorithms span human languages

Page 7: NLTK: the Good, the Bad, and the Awesome

More Awesome

• Trained models can be very fast

• Well known algorithms can be very accurate

• NLTK-Trainer (train models with 0 code)

• Corpus bootstrapping

Page 8: NLTK: the Good, the Bad, and the Awesome

Some Numbers• 3 Classification Algorithms

• 9 Part-of-Speech Tagging Algorithms

• Stemming Algorithms for 15 Languages

• 5 Word Tokenization Algorithms

• Sentence Tokenizers for 16 Languages

• 60 included corpora

Page 9: NLTK: the Good, the Bad, and the Awesome

Text-Processing.com

• NLTK Demos & APIs

• Sentiment Analysis

• Part-of-Speech Tagging & Chunking / NER

• Stemming

• Tokenization

Page 10: NLTK: the Good, the Bad, and the Awesome

Memory Usagetext-processing.com

Page 11: NLTK: the Good, the Bad, and the Awesome

CPU Usagetext-processing.com

Page 12: NLTK: the Good, the Bad, and the Awesome

NLTK-Trainer

• https://github.com/japerk/nltk-trainer

• 3 Training Command Scripts

‣ train_classifier.py

‣ train_tagger.py

‣ train_chunker.py

• Easy to tweak training parameters

• Duck-Typed corpus reading

Page 13: NLTK: the Good, the Bad, and the Awesome

Training Classifiers

• train_classifier.py movie_reviews --instances paras

• train_classifier.py movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2

• train_classifier.py movie_reviews --instances paras --classifier MEGAM

• train_classifier.py movie_reviews --instances paras --cross-fold 10

• Pickled models are saved in ~/nltk_data/classifiers/

Page 14: NLTK: the Good, the Bad, and the Awesome

Training Taggers

• train_tagger.py treebank

• train_tagger.py treebank --sequential ubt --brill

• train_tagger.py treebank --sequential ‘’ --classifier NaiveBayes

• train_tagger.py mac_morpho --simplify_tags

• Pickled models are saved in ~/nltk_data/taggers/

Page 15: NLTK: the Good, the Bad, and the Awesome

Training Chunkers

• train_chunker.py treebank_chunk

• train_chunker.py treebank_chunk --classifier NaiveBayes

• train_chunker.py conll2000 --fileids train.txt

• Pickled models are saved in ~/nltk_data/chunkers/

Page 16: NLTK: the Good, the Bad, and the Awesome

Corpus Bootstrapping

• Guess & Correct easier than starting from scratch

• Use an existing model for initial guesses

• emoticons

‣ :) = “pos”

‣ :( = “neg”

• ratings

‣ 5 stars = “pos”

‣ 1 star = “neg”

Page 17: NLTK: the Good, the Bad, and the Awesome

Portuguese Phrase Extraction & Classification• similar to condensr.com

• Brazilian Portuguese

• aspect classification is easy with training corpus

• need chunked corpus for phrase extraction

• use mac_morpho & nltk-trainer to train initial tagger

• part-of-speech tag annotation is time consuming

• simplified tags are much easier

• bracketed phrases w/out pos tags

Page 18: NLTK: the Good, the Bad, and the Awesome

treebank_chunk[ Pierre/NNP Vinken/NNP ],/, [ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB [ the/DT board/NN ]as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.

Page 19: NLTK: the Good, the Bad, and the Awesome

Just Brackets

[ Pierre Vinken ] , [ 61 years ] old , will join [ the board ] as [ a nonexecutive director Nov. 29 ] .

Page 20: NLTK: the Good, the Bad, and the Awesome

NLP at Weotta

• Parsing & information extraction

• Text cleaning & normalization (more parsing)

• Text & keyword classification

• De-duplication

• Search indexing / IR

• Sentiment analysis

• Human integration