python 3 march15 - department of computer science · 2011-05-15 · nltk importnltk’...
TRANSCRIPT
Python 3
March 15, 2011
NLTK
import nltk nltk.download()
NLTK
import nltk from nltk.book import *
texts()
1. Look at the lists of available texts
NLTK
import nltk from nltk.book import *
print text1[0:50]
2. Check out what the text1 (Moby Dick) object looks like
NLTK
import nltk from nltk.book import *
print text1[0:50] Looks like a list of
word tokens
2. Check out what the text1 (Moby Dick) object looks like
NLTK 3. Get list of top most frequent word TOKENS
import nltk from nltk.book import *
fd=FreqDist(text1)
print fd.keys()[0:10]
NLTK
import nltk from nltk.book import *
fd=FreqDist(text1)
print fd.keys()[0:10]
FreqDist is an object defined by NLTK hVp://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-‐class.html
Give it a list of word tokens
It will be automa[cally sorted. Print the first 10 keys
3. Get list of top most frequent word TOKENS
NLTK
import nltk from nltk.book import *
text1.concordance("and")
4. Now get a concordance of the third most common word
NLTK
import nltk from nltk.book import *
text1.concordance("and")
concordance is method defined for an nltk text hVp://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-‐class.html#concordance
concordance(self, word, width=79, lines=25) Print a concordance for word with the specified context window.
4. Now get a concordance of the third most common word
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it:
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it: Make a new list of tokens
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it: Make a new list of tokens
Call it mobyDick
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it: Make a new list of tokens
Call it mobyDick
For each token x in the original list…
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it: Make a new list of tokens
Call it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace each , with nothing
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it: Make a new list of tokens
Call it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace each , with nothing
Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it: Make a new list of tokens
Call it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace each , with nothing
Make a new FreqDist with the new list of tokens, call it fd
Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it: Make a new list of tokens
Call it mobyDick
For each token x in the original list…
Copy the token into the new list, except replace each , with nothing
Print it like before
Make a new FreqDist with the new list of tokens, call it fd
Then, finally, just look at the nonempty tokens (not what was originally “.” and is now empty)
String Opera[ons
String Opera[ons
import nltk from nltk.book import *
mobyDick=[x.replace(",","") for x in text1] mobyDick=[x.replace(";","") for x in mobyDick] mobyDick=[x.replace(".","") for x in mobyDick] mobyDick=[x.replace("'","") for x in mobyDick] mobyDick=[x.replace("-‐","") for x in mobyDick] mobyDick=[x for x in mobyDick if len(x)>1]
fd=FreqDist(mobyDick) print fd.keys()[0:10]
5. What if you don't want punctua[on in your list?
First, simple way to fix it:
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Import regular expression module
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Compile a regular expression
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
The RegEx will match any of the characters inside the brackets
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Call the “sub” func[on associated with the RegEx
named punctua[on
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Replace anything that matches the RegEx with nothing
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
As before, do this to each token in the text1 list
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Call this new list punctua[onRemoved
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Get a FreqDist of all tokens with length >1
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Print the top 10 word tokens as usual
Regular Expressions
import nltk
from nltk.book import *
import re
punctua[on = re.compile("[,.; '-‐]")
punctua[onRemoved=[punctua[on.sub("",x) for x in text1]
fd=FreqDist([x for x in punctua[onRemoved if len(x)>1])
print fd.keys()[0:10]
6. Now the more complicated, but less typing way:
Regular Expressions are Really Powerful and Useful!
Quick Diversion
import nltk
from nltk.book import *
import re
print fd.keys()[-‐10:]
7. What if you wanted to see the least common word tokens?
Quick Diversion
import nltk
from nltk.book import *
import re
print fd.keys()[-‐10:]
7. What if you wanted to see the least common word tokens?
Print the tokens from posi[on -‐10 to the end
Quick Diversion
import nltk
from nltk.book import *
import re
print [(k, fd[k]) for k in fd.keys()[0:10]]
8. And what if you wanted to see the frequencies with the words?
For each key “k” in the FreqDist, print it and look up
its value (fd[k])
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")
print colorsRegEx.sub("color",myString)
9. Another simple example
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")
print colorsRegEx.sub("color",myString)
9. Another simple example
Looks similar to the RegEx that matched punctua[on before
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")
print colorsRegEx.sub("color",myString)
9. Another simple example
This RegEx matches the substring “blue” or the substring “red” or the substring “green”
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
colorsRegEx=re.compile("blue|red|green")
print colorsRegEx.sub("color",myString)
9. Another simple example
Here, subs[tute anything that matches the RegEx with the string “color”
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
10. A more interes[ng example
What if we wanted to iden[fy all of the phone numbers in the string?
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile('\d{11}')
print phoneNumbersRegEx.findall(myString)
10. A more interes[ng example
Note that \d is a digit, and {11} matches 11
digits in a row
This is a start. Output: ['18005551234']
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile('\d{11}')
print phoneNumbersRegEx.findall(myString)
10. A more interes[ng example
findall will return a list of all substrings of myString that
match the RegEx
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile('\d{11}')
print phoneNumbersRegEx.findall(myString)
10. A more interes[ng example
Also will need to know:
“?” will match 0 or 1 repe[[ons of the previous element
Note: find lots more informa[on on regular expressions here: hVp://docs.python.org/library/re.html
Back to Regular Expressions
import re
myString="I have red shoes and blue pants and a green shirt. My phone number is 8005551234 and my friend's phone number is (800)-‐565-‐7568 and my cell number is 1-‐800-‐123-‐4567. You could also call me at 18005551234 if you'd like.”
phoneNumbersRegEx=re.compile(''1?-‐?\(?\d{3}\)?-‐?\d{3}-‐?\d{4}'')
print phoneNumbersRegEx.findall(myString)
10. A more interes[ng example
Answer is here, but let’s derive it together