python 3 march15 - department of computer science · 2011-05-15 · nltk importnltk’...

41
Python 3 March 15, 2011

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Python  3  

March  15,  2011  

Page 2: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  

import  nltk  nltk.download()  

Page 3: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  

import  nltk  from  nltk.book  import  *  

texts()  

1.  Look  at  the  lists  of  available  texts  

Page 4: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  

import  nltk  from  nltk.book  import  *  

print  text1[0:50]  

2.  Check  out  what  the  text1  (Moby  Dick)  object  looks  like  

Page 5: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  

import  nltk  from  nltk.book  import  *  

print  text1[0:50]  Looks  like  a  list  of  

word  tokens  

2.  Check  out  what  the  text1  (Moby  Dick)  object  looks  like  

Page 6: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  3.  Get  list  of  top  most  frequent  word  TOKENS  

import  nltk  from  nltk.book  import  *  

fd=FreqDist(text1)  

print  fd.keys()[0:10]  

Page 7: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  

import  nltk  from  nltk.book  import  *  

fd=FreqDist(text1)  

print  fd.keys()[0:10]  

FreqDist  is  an  object  defined  by  NLTK  hVp://www.opendocs.net/nltk/0.9.5/api/nltk.probability.FreqDist-­‐class.html  

Give  it  a  list  of  word  tokens  

It  will  be  automa[cally  sorted.    Print  the  first  10  keys  

3.  Get  list  of  top  most  frequent  word  TOKENS  

Page 8: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  

import  nltk  from  nltk.book  import  *  

text1.concordance("and")  

4.  Now  get  a  concordance  of  the  third  most  common  word  

Page 9: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

NLTK  

import  nltk  from  nltk.book  import  *  

text1.concordance("and")  

concordance  is  method  defined  for  an  nltk  text  hVp://nltk.googlecode.com/svn/trunk/doc/api/nltk.text.Text-­‐class.html#concordance  

concordance(self,  word,  width=79,  lines=25)  Print  a  concordance  for  word  with  the  specified  context  window.  

4.  Now  get  a  concordance  of  the  third  most  common  word  

Page 10: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  

String  Opera[ons  

Page 11: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  Make  a  new  list  of  tokens  

String  Opera[ons  

Page 12: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  Make  a  new  list  of  tokens  

Call  it  mobyDick  

String  Opera[ons  

Page 13: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  Make  a  new  list  of  tokens  

Call  it  mobyDick  

For  each  token  x  in  the  original  list…  

String  Opera[ons  

Page 14: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  Make  a  new  list  of  tokens  

Call  it  mobyDick  

For  each  token  x  in  the  original  list…  

Copy  the  token  into  the  new  list,  except  replace  each  ,  with  nothing  

String  Opera[ons  

Page 15: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  Make  a  new  list  of  tokens  

Call  it  mobyDick  

For  each  token  x  in  the  original  list…  

Copy  the  token  into  the  new  list,  except  replace  each  ,  with  nothing  

Then,  finally,  just  look  at  the  nonempty  tokens  (not  what  was  originally  “.”  and  is  now  empty)  

String  Opera[ons  

Page 16: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  Make  a  new  list  of  tokens  

Call  it  mobyDick  

For  each  token  x  in  the  original  list…  

Copy  the  token  into  the  new  list,  except  replace  each  ,  with  nothing  

Make  a  new  FreqDist  with  the  new  list  of  tokens,  call  it  fd  

Then,  finally,  just  look  at  the  nonempty  tokens  (not  what  was  originally  “.”  and  is  now  empty)  

String  Opera[ons  

Page 17: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  Make  a  new  list  of  tokens  

Call  it  mobyDick  

For  each  token  x  in  the  original  list…  

Copy  the  token  into  the  new  list,  except  replace  each  ,  with  nothing  

Print  it  like  before  

Make  a  new  FreqDist  with  the  new  list  of  tokens,  call  it  fd  

Then,  finally,  just  look  at  the  nonempty  tokens  (not  what  was  originally  “.”  and  is  now  empty)  

String  Opera[ons  

Page 18: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

String  Opera[ons  

import  nltk  from  nltk.book  import  *  

mobyDick=[x.replace(",","")  for  x  in  text1]  mobyDick=[x.replace(";","")  for  x  in  mobyDick]  mobyDick=[x.replace(".","")  for  x  in  mobyDick]  mobyDick=[x.replace("'","")  for  x  in  mobyDick]  mobyDick=[x.replace("-­‐","")  for  x  in  mobyDick]  mobyDick=[x  for  x  in  mobyDick  if  len(x)>1]  

fd=FreqDist(mobyDick)  print  fd.keys()[0:10]  

5.  What  if  you  don't  want  punctua[on  in  your  list?  

First,  simple  way  to  fix  it:  

Page 19: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Page 20: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Import  regular  expression  module  

Page 21: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Compile  a  regular  expression  

Page 22: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

The  RegEx  will  match  any  of  the  characters  inside  the  brackets  

Page 23: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Call  the  “sub”  func[on  associated  with  the  RegEx  

named  punctua[on  

Page 24: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Replace  anything  that  matches  the  RegEx  with  nothing  

Page 25: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

As  before,  do  this  to  each  token  in  the  text1  list  

Page 26: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Call  this  new  list  punctua[onRemoved  

Page 27: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Get  a  FreqDist  of  all  tokens  with  length  >1  

Page 28: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Print  the  top  10  word  tokens  as  usual  

Page 29: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Regular  Expressions  

import  nltk  

from  nltk.book  import  *  

import  re  

punctua[on  =  re.compile("[,.;  '-­‐]")  

punctua[onRemoved=[punctua[on.sub("",x)  for  x  in  text1]  

fd=FreqDist([x  for  x  in  punctua[onRemoved  if  len(x)>1])  

print  fd.keys()[0:10]  

6.  Now  the  more  complicated,  but  less  typing  way:  

Regular  Expressions  are  Really  Powerful  and  Useful!  

Page 30: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Quick  Diversion  

import  nltk  

from  nltk.book  import  *  

import  re  

print  fd.keys()[-­‐10:]  

7.  What  if  you  wanted  to  see  the  least  common  word  tokens?  

Page 31: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Quick  Diversion  

import  nltk  

from  nltk.book  import  *  

import  re  

print  fd.keys()[-­‐10:]  

7.  What  if  you  wanted  to  see  the  least  common  word  tokens?  

Print  the  tokens  from  posi[on  -­‐10  to  the  end  

Page 32: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Quick  Diversion  

import  nltk  

from  nltk.book  import  *  

import  re  

print  [(k,  fd[k])  for  k  in  fd.keys()[0:10]]  

8.  And  what  if  you  wanted  to  see  the  frequencies  with  the  words?  

For  each  key  “k”  in  the  FreqDist,  print  it  and  look  up  

its  value  (fd[k])  

Page 33: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

colorsRegEx=re.compile("blue|red|green")  

print  colorsRegEx.sub("color",myString)  

9.  Another  simple  example  

Page 34: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

colorsRegEx=re.compile("blue|red|green")  

print  colorsRegEx.sub("color",myString)  

9.  Another  simple  example  

Looks  similar  to  the  RegEx  that  matched  punctua[on  before  

Page 35: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

colorsRegEx=re.compile("blue|red|green")  

print  colorsRegEx.sub("color",myString)  

9.  Another  simple  example  

This  RegEx  matches  the  substring  “blue”  or  the  substring  “red”  or  the  substring  “green”  

Page 36: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

colorsRegEx=re.compile("blue|red|green")  

print  colorsRegEx.sub("color",myString)  

9.  Another  simple  example  

Here,  subs[tute  anything  that  matches  the  RegEx  with  the  string  “color”  

Page 37: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

10.  A  more  interes[ng  example  

What  if  we  wanted  to  iden[fy  all  of  the  phone  numbers  in  the  string?  

Page 38: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

phoneNumbersRegEx=re.compile('\d{11}')  

print  phoneNumbersRegEx.findall(myString)  

10.  A  more  interes[ng  example  

Note  that  \d  is  a  digit,  and  {11}  matches  11  

digits  in  a  row  

This  is  a  start.    Output:  ['18005551234']  

Page 39: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

phoneNumbersRegEx=re.compile('\d{11}')  

print  phoneNumbersRegEx.findall(myString)  

10.  A  more  interes[ng  example  

findall  will  return  a  list  of  all  substrings  of  myString  that  

match  the  RegEx  

Page 40: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

phoneNumbersRegEx=re.compile('\d{11}')  

print  phoneNumbersRegEx.findall(myString)  

10.  A  more  interes[ng  example  

Also  will  need  to  know:  

“?”  will  match  0  or  1  repe[[ons  of  the  previous  element  

Note:  find  lots  more  informa[on  on  regular  expressions  here:    hVp://docs.python.org/library/re.html  

Page 41: python 3 march15 - Department of Computer Science · 2011-05-15 · NLTK importnltk’ from’nltk.book’import* printtext1[0:50]’ Looks’like’alistof’ word’tokens’ 2.’Check’outwhatthe’text1’(Moby’Dick

Back  to  Regular  Expressions  

import  re  

myString="I  have  red  shoes  and  blue  pants  and  a  green  shirt.  My  phone  number  is  8005551234  and  my  friend's  phone  number  is  (800)-­‐565-­‐7568  and  my  cell  number  is  1-­‐800-­‐123-­‐4567.  You  could  also  call  me  at  18005551234  if  you'd  like.”  

phoneNumbersRegEx=re.compile(''1?-­‐?\(?\d{3}\)?-­‐?\d{3}-­‐?\d{4}'')  

print  phoneNumbersRegEx.findall(myString)  

10.  A  more  interes[ng  example  

Answer  is  here,  but  let’s  derive  it  together