python 3 march15 - department of computer science · 2011-05-15 · nltk importnltk’...

Python 3

March 15, 2011

NLTK

import nltk nltk.download()

NLTK

import nltk from nltk.book import *

texts()

1. Look at the lists of available texts

NLTK


print text1[0:50]

2. Check out what the text1 (Moby Dick) object looks like

NLTK


print text1[0:50] Looks like a list of

word tokens

2. Check out what the text1 (Moby Dick) object looks like

NLTK 3. Get list of top most frequent word TOKENS


fd=FreqDist(text1)

print fd.keys()[0:10]

NLTK


fd=FreqDist(text1)


FreqDist is an object defined by NLTK hVp://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-‐class.html

Give it a list of word tokens

It will be automa[cally sorted. Print the first 10 keys

3. Get list of top most frequent word TOKENS

NLTK


text1.concordance("and")

4. Now get a concordance of the third most common word

NLTK


text1.concordance("and")

concordance is method defined for an nltk text hVp://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-‐class.html#concordance

concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window.

4. Now get a concordance of the third most common word


mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]

fd=FreqDist(mobyDick) print fd.keys()[0:10]

5. What if you don't want punctua[on in your list?

First, simple way to fix it:

String Opera[ons





First, simple way to fix it: Make a new list of tokens

String Opera[ons






Call it mobyDick

String Opera[ons






Call it mobyDick

For each token x in the original list…

String Opera[ons






Call it mobyDick


Copy the token into the new list, except replace each , with nothing

String Opera[ons






Call it mobyDick



Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)

String Opera[ons






Call it mobyDick



Make a new FreqDist with the new list of tokens, call it fd


String Opera[ons






Call it mobyDick



Print it like before

Make a new FreqDist with the new list of tokens, call it fd


String Opera[ons

String Opera[ons





First, simple way to fix it:

Regular Expressions

import nltk

from nltk.book import *

import re

punctua[on = re.compile("[,.; '-‐]")

punctua[onRemoved=[punctua[on.sub("",x) for x in text1]

fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])


6. Now the more complicated, but less typing way:

Regular Expressions

import nltk


import re






Import regular expression module

Regular Expressions

import nltk


import re






Compile a regular expression

Regular Expressions

import nltk


import re






The RegEx will match any of the characters inside the brackets

Regular Expressions

import nltk


import re






Call the “sub” func[on associated with the RegEx

named punctua[on

Regular Expressions

import nltk


import re






Replace anything that matches the RegEx with nothing

Regular Expressions

import nltk


import re






As before, do this to each token in the text1 list

Regular Expressions

import nltk


import re






Call this new list punctua[onRemoved

Regular Expressions

import nltk


import re






Get a FreqDist of all tokens with length >1

Regular Expressions

import nltk


import re






Print the top 10 word tokens as usual

Regular Expressions

import nltk


import re






Regular Expressions are Really Powerful and Useful!

Quick Diversion

import nltk


import re

print fd.keys()[-‐10:]

7. What if you wanted to see the least common word tokens?

Quick Diversion

import nltk


import re

print fd.keys()[-‐10:]

7. What if you wanted to see the least common word tokens?

Print the tokens from posi[on -‐10 to the end

Quick Diversion

import nltk


import re

print [(k, fd[k]) for k in fd.keys()[0:10]]

8. And what if you wanted to see the frequencies with the words?

For each key “k” in the FreqDist, print it and look up

its value (fd[k])

Back to Regular Expressions

import re

myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”

colorsRegEx=re.compile("blue|red|green")

print colorsRegEx.sub("color",myString)

9. Another simple example


import re





Looks similar to the RegEx that matched punctua[on before


import re





This RegEx matches the substring “blue” or the substring “red” or the substring “green”


import re





Here, subs[tute anything that matches the RegEx with the string “color”


import re


10. A more interes[ng example

What if we wanted to iden[fy all of the phone numbers in the string?


import re


phoneNumbersRegEx=re.compile('\d{11}')

print phoneNumbersRegEx.findall(myString)


Note that \d is a digit, and {11} matches 11

digits in a row

This is a start. Output: ['18005551234']


import re





findall will return a list of all substrings of myString that

match the RegEx


import re





Also will need to know:

“?” will match 0 or 1 repe[[ons of the previous element

Note: find lots more informa[on on regular expressions here: hVp://docs.python.org/library/re.html


import re


phoneNumbersRegEx=re.compile(''1?-‐?\(?\d{3}\)?-‐?\d{3}-‐?\d{4}'')



Answer is here, but let’s derive it together

python 3 march15 - department of computer science · 2011-05-15 · nltk importnltk’...

Documents