corpus bootstrapping with nltk

31
Corpus Bootstrapping with NLTK by Jacob Perkins

Upload: jacob-perkins

Post on 01-Nov-2014

5.638 views

Category:

Technology


3 download

DESCRIPTION

Presented at Strata 2012 Deep Data session.

TRANSCRIPT

Page 1: Corpus Bootstrapping with NLTK

Corpus Bootstrapping with NLTKby Jacob Perkins

Page 2: Corpus Bootstrapping with NLTK

Jacob Perkins

http://www.weotta.com

http://streamhacker.com

http://text-processing.com

https://github.com/japerk/nltk-trainer

@japerk

Page 3: Corpus Bootstrapping with NLTK

Problem

you want to do NLProc

many proven supervised training algorithms

but you don’t have a training corpus

Page 4: Corpus Bootstrapping with NLTK

Solution

make a custom training corpus

Page 5: Corpus Bootstrapping with NLTK

Problems with Manual Annotation

takes time

requires expertise

expert time costs $$$

Page 6: Corpus Bootstrapping with NLTK

Solution: Bootstrap

less time

less expertise

costs less

requires thinking & creativity

Page 7: Corpus Bootstrapping with NLTK

Corpus Bootstrapping at Weotta

review sentiment

keyword classification

phrase extraction & classification

Page 8: Corpus Bootstrapping with NLTK

Bootstrapping Examples

english -> spanish sentiment

phrase extraction

Page 9: Corpus Bootstrapping with NLTK

Translating Sentiment

start with english sentiment corpus & classifier

english -> spanish -> spanish

Page 10: Corpus Bootstrapping with NLTK

English -> Spanish -> Spanish

1. translate english examples to spanish

2. train classifier

3. classify spanish text into new corpus

4. correct new corpus

5. retrain classifier

6. add to corpus & goto 4 until done

Page 11: Corpus Bootstrapping with NLTK

Translate Corpus

$ translate_corpus.py movie_reviews --source english --target spanish

Page 12: Corpus Bootstrapping with NLTK

Train Initial Classifier

$ train_classifier.py spanish_movie_reviews

Page 13: Corpus Bootstrapping with NLTK

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

Page 14: Corpus Bootstrapping with NLTK

Manual Correction

1. scan each file

2. move incorrect examples to correct file

Page 15: Corpus Bootstrapping with NLTK

Train New Classifier

$ train_classifier.py spanish_sentiment

Page 16: Corpus Bootstrapping with NLTK

Adding to the Corpus

start with >90% probability

retrain

carefully decrease probability threshold

Page 17: Corpus Bootstrapping with NLTK

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt

Page 18: Corpus Bootstrapping with NLTK

When are you done?

what level of accuracy do you need?

does your corpus reflect real text?

how much time do you have?

Page 19: Corpus Bootstrapping with NLTK

Tips

garbage in, garbage out

correct bad data

clean & scrub text

experiment with train_classifier.py options

create custom features

Page 20: Corpus Bootstrapping with NLTK

Bootstrapping a Phrase Extractor1. find a pos tagged corpus

2. annotate raw text

3. train pos tagger

4. create pos tagged & chunked corpus

5. tag unknown words

6. train pos tagger & chunker

7. correct errors

8. add to corpus, goto 5 until done

Page 21: Corpus Bootstrapping with NLTK

NLTK Tagged Corpora

English: brown, conll2000, treebank

Portuguese: mac_morpho, floresta

Spanish: cess_esp, conll2002

Catalan: cess_cat

Dutch: alpino, conll2002

Indian Languages: indian

Chinese: sinica_treebank

see http://text-processing.com/demo/tag/

Page 22: Corpus Bootstrapping with NLTK

Train Tagger

$ train_tagger.py treebank --simplify_tags

Page 23: Corpus Bootstrapping with NLTK

Phrase Annotation

Hello world, [this is an important phrase].

Page 24: Corpus Bootstrapping with NLTK

Tag Phrases

$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

Page 25: Corpus Bootstrapping with NLTK

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

Page 26: Corpus Bootstrapping with NLTK

Correct Unknown Words

1. find -NONE- tagged words

2. fix tags

Page 27: Corpus Bootstrapping with NLTK

Train New Tagger

$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Page 28: Corpus Bootstrapping with NLTK

Train Chunker

$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Page 29: Corpus Bootstrapping with NLTK

Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d

sents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

Page 30: Corpus Bootstrapping with NLTK

Final Tips

error correction is faster than manual annotation

find close enough corpora

use nltk-trainer to experiment

iterate -> quality

no substitute for human judgement