Download - Practical Natural Language Processing
![Page 1: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/1.jpg)
Practical NaturalLanguage Processing
Catherine HavasiLuminoso / MIT Media Lab
I have always found this to be the dow
nsideIt w
as good, but there wasn’t
Couldn’t understand
Everything I could have expected ifNever saw
Love if I was a drunk college
wet dog
Christine C. Quinn, the New York City Council speaker, released a video and planned to visit all five boroughs on Sunday as she officially began her campaign Many social norms, like “tha.
![Page 2: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/2.jpg)
There are notes!luminoso.com/blog
![Page 3: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/3.jpg)
Too much text?
![Page 4: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/4.jpg)
Wouldn’t it be cool if we could talk to a computer?
![Page 5: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/5.jpg)
This is hard.
It takes a lot of knowledge to understand language
![Page 6: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/6.jpg)
I made her duck.
![Page 7: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/7.jpg)
I made her duck
• I cooked waterfowl for her benefit (to eat)• I cooked waterfowl belonging to her • I created the (plaster?) duck she owns• I made sure she got her head down• I waved my magic wand and turned her
into undifferentiated waterfowl
![Page 8: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/8.jpg)
Language is Recursive• You can build new concepts out of old
ones indefinitely
![Page 9: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/9.jpg)
Confidential Luminoso http://lumino.so
Language is Creative
It was really stuffy.
Smelled really musty.Reminds me of a dusty closet.
Was like a wet dog.
It was like it had been shut away for a long time.
Smells like an old house.
Really stale.
It smelled terrible.
![Page 10: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/10.jpg)
A multi-lingual world
![Page 11: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/11.jpg)
Linguistics to the rescue?
![Page 12: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/12.jpg)
Linguistics to the rescue?
--Randall Munroe, xkcd.org/114
![Page 13: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/13.jpg)
“Much Debate”
![Page 14: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/14.jpg)
We just want to get things done.
![Page 15: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/15.jpg)
So, what is state of the art?
![Page 16: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/16.jpg)
The NLP process
• Take in a string of language• Where are the words?• What are the root forms of these words?• How do the words fit together?• Which words look important?• What decisions should we make based on
these words?
![Page 17: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/17.jpg)
The NLP process (simplified)
• Fake understanding
![Page 18: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/18.jpg)
The NLP process (simplified)
• Fake understanding• Until you make understanding
![Page 19: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/19.jpg)
Example: Detecting bad words
• You want to flag content with certain bad words in it
• Don’t just match sequences of characters• That would lead to this classic mistake
![Page 20: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/20.jpg)
Many forms of fowl language
• Suppose we want people to not say the word “duck”
![Page 21: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/21.jpg)
Many forms of fowl language
“What the duck’s wrong with this”
“It’s all ducked up”
“Un-ducking-believable”
![Page 22: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/22.jpg)
Step 1: break text into tokens
it’sallduckedupunduckingbelievable
![Page 23: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/23.jpg)
Step 2: replace tokens with their root forms
it → it’s → isall → allducked → duckup → upun → unducking → duckbelievable → believe
![Page 24: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/24.jpg)
In a few lines of Python:>>> import nltk>>> text = "It's all ducked up. Un-ducking-believable.">>> tokens = nltk.wordpunct_tokenize(text.lower())>>> tokens[’it', "'", 's', 'all', 'ducked', 'up', '.', ’un', '-',
'ducking', '-', 'believable', '.']
>>> stemmer = nltk.stem.PorterStemmer()>>> [stemmer.stem_word(token) for token in tokens][’it', "'", 's', 'all', 'duck', 'up', '.', ’un', '-',
'duck', '-', 'believ', '.']
![Page 25: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/25.jpg)
Stemmers can spell things oddly
• duck → duck• ducking → duck• believe → believ• believable → believ• happy → happi• happiness → happi
![Page 26: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/26.jpg)
Stemmers can mix up some words
• sincere → sincer• sincerity → sincer• universe → univers• university → univers
![Page 27: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/27.jpg)
The NLP tool chain
• Some source of text (a database, a labeled corpus, Web scraping, Twitter...)
• Tokenizer: breaks text into word-like things• Stemmer: finds words with the same root• Tagger: identifies parts of speech• Chunker: identifies key phrases• Something that makes decisions based on
these results
![Page 28: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/28.jpg)
Useful toolkits
• NLTK (Python)• LingPipe (Java)• Stanford Core NLP (Java; many wrappers)• FreeLing (C++)
![Page 29: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/29.jpg)
The statistics of text
• Often we want to understand the differences between different categories of text– Different genres– Different writers– Different forms of writing
![Page 30: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/30.jpg)
Collecting word counts
• Start with a corpus of text• Brown corpus (1961)• British National Corpus (1993)• Google Books (2009, 2012)
![Page 31: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/31.jpg)
Collecting word counts
>>> import nltk>>> from nltk.corpus import brown>>> from collections import Counter>>> counts = Counter(brown.words())>>> counts.most_common()[:20][('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]
![Page 32: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/32.jpg)
Collecting word counts
>>> for category in brown.categories():... frequency = Counter(brown.words(... categories=category))...... for word in frequency:... frequency[word] /= counts[word] + 100....... # format the results nicely... print "%20s -> %s" % (category, ... ', '.join(word for word, prop... in frequency.most_common()[:10]))
![Page 33: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/33.jpg)
Prominent words by category editorial -> Berlin, Khrushchev, East, editor, nuclear, West, Soviet, Podger, Kennedy, budget fiction -> Kate, Winston, Scotty, Rector, Hans, Watson, Alex, Eileen, doctor, ! government -> fiscal, Rhode, Act, Government, shelter, States, tax, Island, property, shall hobbies -> feed, clay, Hanover, site, your, design, mold, Class, Junior, Juniors news -> Mrs., Monday, Mantle, yesterday, Dallas, Texas, Kennedy, Tuesday, jury, Palmer religion -> God, Christ, Him, Christian, Jesus, membership, faith, sin, Church, Catholic reviews -> music, musical, Sept., jazz, Keys, audience, singing, Newport, cholesterol science_fiction -> Ekstrohm, Helva, Hal, B'dikkat, Mercer, Ryan, Earth, ship, Mike, Hesperus
![Page 34: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/34.jpg)
Classifying text
• We can take text that’s categorized and figure out its word frequencies
• Wouldn’t it be more useful to look at word frequencies and figure out the category?
![Page 35: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/35.jpg)
Example: Spam filtering
• Paul Graham’s SpamBayes (2002)• Remember what e-mail was like before 2002?• A simple classifier (Naive Bayes) changed
everything
![Page 36: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/36.jpg)
Supervised classification
• Distinguish things from other things based on examples
![Page 37: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/37.jpg)
Applications
• Spam filtering• Detecting important e-mails• Topic detection• Language detection• Sentiment analysis
![Page 38: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/38.jpg)
Naive Bayes
• We know the probability of various data given a category
• Estimate the probability of the category given the data
• Assume all features of the data are independent (that’s the naive part)
• It’s simple• It’s fast• Sometimes it even works
![Page 39: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/39.jpg)
A quick Naive Bayes experiment
• nltk.corpus.movie_reviews: movie reviews labeled as ‘pos’ or ‘neg’
• Define document_features(doc) to describe a document by the words it contains
![Page 40: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/40.jpg)
![Page 41: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/41.jpg)
Statistics beyond single words
• Many interesting things about text are longer than one word
• bigram: a sequence of two tokens• collocation: a bigram that seems to be more
than the sum of its parts
![Page 42: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/42.jpg)
When is a bigram interesting?
#(vice president) #(vice)
#(president) total words
![Page 43: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/43.jpg)
Guess the text
![Page 44: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/44.jpg)
Guess the text
>>> from nltk.book import text4>>> text4.collocations()United States; fellow citizens; four years; years
ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties
![Page 45: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/45.jpg)
Guess the text
>>> from nltk.book import text3>>> text3.collocations()said unto; pray thee; thou shalt; thou hast; thy
seed; years old; spake unto; thou art; LORD God; every living; God hath; begat sons; seven years; shalt thou; little ones; living creature; creeping thing; savoury meat; thirty years; every beast
![Page 46: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/46.jpg)
Guess the text
>>> from nltk.book import text6>>> text6.collocations()BLACK KNIGHT; HEAD KNIGHT; Holy Grail;
FRENCH GUARD; Sir Robin; Run away; CARTOON CHARACTER; King Arthur; Iesu domine; Pie Iesu; DEAD PERSON; Round Table; OLD MAN; dramatic chord; dona eis; eis requiem; LEFT HEAD; FRENCH GUARDS; music stops; Sir Launcelot
![Page 47: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/47.jpg)
What about grammar?
• Eh• Too hard
![Page 48: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/48.jpg)
What about word meanings?
• “I liked the movie.”• “I enjoyed the film.”• These have a lot more in common than “I” and
“the”.
![Page 49: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/49.jpg)
WordNet
• A dictionary for computers• Contains links between definitions• Words form (roughly) a tree
![Page 50: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/50.jpg)
good, right, ripe – (most suitable or right for a particular purpose; "a good time to plant tomatoes"; "the right time to act"; "the time is ripe for great sociological changes")
Glosses
Synset Definition
![Page 51: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/51.jpg)
Measuring word similarity
• Various methods of measuring word similarity using paths in WordNet
>>> from nltk.corpus import wordnet as wn>>> wn.wup_similarity(wn.synset('movie.n.1'), wn.synset('film.n.1'))
1.0>>> wn.wup_similarity(wn.synset('cat.n.1'), wn.synset('dog.n.1'))
0.8571>>> wn.wup_similarity(wn.synset('cat.n.1'), wn.synset('movie.n.1'))
0.3636
![Page 52: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/52.jpg)
The black hats have WordNet too
• This is why content farms might try to tell you “What to Anticipate When You’re Anticipating”
![Page 53: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/53.jpg)
Limitations of WordNet
>>> print wn.wup_similarity(wn.synset('taxi.n.1'), wn.synset('driver.n.1'))
0.235294117647
>>> print wn.wup_similarity(wn.synset(’kitten.n.1'), wn.synset(’adorable.a.1'))
None
![Page 54: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/54.jpg)
ConceptNet
• More types of word relationships• More languages• Less precise definitions• Conceptnet5.media.mit.edu
![Page 55: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/55.jpg)
buy groceries
money
wallet
cookrequires
requires
bank
locati
on location
supermarket
is fo
r
groceries
buy
has verbha
s obj
ecttakes object
requires
spend
take
s obj
ect
produce
type
of
foodpart of
take
s obj
ect
loca
tion
building
person
wants
has
expense
relate
d to
type of
does
n’t w
ant
has
cashierta
kes o
bjec
tlocation
type o
f
company
type of
has
is for
requires
![Page 56: Practical Natural Language Processing](https://reader036.vdocument.in/reader036/viewer/2022081418/56816635550346895dd9a289/html5/thumbnails/56.jpg)
Take-away points• NLP in general is hard• Specific things are easy• Find tools that work well and chain them
together• Try experimenting with NLTK• If you need to classify things, try Naive Bayes
first