accessing files with nltk regular expressions

53
Accessing files with NLTK Regular Expressions

Upload: noel

Post on 25-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Accessing files with NLTK Regular Expressions. Accessing additional files. Python has tools for accessing files from the local directories and also for obtaining files from the web. Python module for web access. urllib2 Note – this is for Python 2.x, not Python 3 - PowerPoint PPT Presentation

TRANSCRIPT

Accessing files with NLTK Regular Expressions

Accessing files with NLTKRegular ExpressionsAccessing additional filesPython has tools for accessing files from the local directories and also for obtaining files from the web.

Python module for web accessurllib2Note this is for Python 2.x, not Python 3Python 3 splits the urllib2 materials over several modulesimport urllib2urllib2.urlopen(url [,data][, timeout])Establish a link with the server identified in the url and send either a GET or POST request to retrieve the page.The optional data field provides data to send to the server as part of the request. If the data field is present, the HTTP request used is POST instead of GETUse to fetch content that is behind a form, perhaps a login pageIf used, the data must be encoded properly for including in an HTTP request. See http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1timeout defines time in seconds to be used for blocking operations such as the connection attempt. If it is not provided, the system wide default value is used.

3http://docs.python.org/library/urllib2.htmlURL fetch and useurlopen returns a file-like object with methods:Same as for files: read(), readline(), readlines(), fileno(), close()New for this class: info() returns meta information about the document at the URLgetcode() returns the HTTP status code sent with the response (ex: 200, 404)geturl() returns the URL of the page, which may be different from the URL requested if the server redirected the request4Reminder, file accessfile.close()File no longer available file.fileno() returns the file descriptor, not usually needed.file.read([size])read at most size bytes. If size not specified, read to end of file.file.readline([size])read one line. If size provided, read that many bytes. Empty string returned if EOF encountered immediatelyfile.readlines([sizehint]) return a list of lines. If sizehint present, return approximately that number of lines, possibly rounding to fill a buffer. Where file is the internal name of the file objectURL infoinfo() provides the header information that http returns when the HEAD request is used.ex: >>> print mypage.info()Date: Mon, 12 Sep 2011 14:23:44 GMTServer: Apache/1.3.27 (Unix)Last-Modified: Tue, 02 Sep 2008 21:12:03 GMTETag: "2f0d4-215f-48bdac23"Accept-Ranges: bytesContent-Length: 8543Connection: closeContent-Type: text/html6URL status and code>>> print mypage.getcode()200

>>> print mypage.geturl()http://www.csc.villanova.edu/~cassel/

7Messy HTMLHTML is not always perfect. Browsers may be forgiving. Human and computerized html generators make mistakes.Tools for dealing with imperfect html include Beautiful Soup.http://www.crummy.com/software/BeautifulSoup/Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

89Exceptions: How to Deal with Error Situationsnumber = 0while not 1 >> re.split('\W+', 'Words, words, words.')['Words', 'words', 'words', '']

>>> re.split('(\W+)', 'Words, words, words.')['Words', ', ', 'words', ', ', 'words', '.', '']

>>> re.split('\W+', 'Words, words, words.', 1)['Words', 'words, words.']

>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)['0', '3', '9']\w = word class: equivalent to [a-zA-Z0-9_]\W = complement of \w all characters other than letters and digitsre.findall(pattern, string[,flags])return all non-overlapping matches of pattern in string, as a list of strings. String scanned left-to-right. Matches returned in order found.Applications of reExtract word pieces

another> word = 'supercalifragilisticexpialidocious'>>> re.findall(r'[aeiou]', word)['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']>>> len(re.findall(r'[aeiou]', word))16>>> wsj = sorted(set(nltk.corpus.treebank.words()))>>> fd = nltk.FreqDist(vs for word in wsj... for vs in re.findall(r'[aeiou]{2,}', word))>>> fd.items()vu50390:ch3 lcassel$ python re2.py[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95), ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ('oe', 15), ('iu', 14), ('ae', 11), ('eau', 10), ('uo', 8), ('ao', 6), ('oui', 6), ('eou', 5), ('uou', 5), ('uee', 4), ('aa', 3), ('ieu', 3), ('uie', 3), ('eei', 2), ('aia', 1), ('aii', 1), ('aiia', 1), ('eea', 1), ('iai', 1), ('iao', 1), ('ioa', 1), ('oei', 1), ('ooi', 1), ('ueui', 1), ('uu', 1)]Spot checkYour Turn: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:

[int(n) for n in re.findall(?, '2009-12-31')]Processing some text>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'>>> def compress(word):... pieces = re.findall(regexp, word)... return ''.join(pieces)...>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')>>> print nltk.tokenwrap(compress(w) for w in english_udhr[:75])

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty andof the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtnof frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmnrghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch andNoting redundancy in English and eliminating internal word vowels:Tabulating combinations>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')>>> cvs = [cv for w in rotokas_words for cv in re.findall\(r'[ptksvr][aeiou]', w)]>>> cfd = nltk.ConditionalFreqDist(cvs)>>> cfd.tabulate() a e i o uk 418 148 94 420 173p 83 31 105 34 51r 187 63 84 89 79s 0 0 100 2 1t 47 8 0 148 37v 93 27 105 48 49Inspecting the words behind the numbers >>> cv_word_pairs = [(cv, w) for w in rotokas_words... for cv in re.findall(r'[ptksvr][aeiou]', w)]>>> cv_index = nltk.Index(cv_word_pairs)>>> cv_index['su']['kasuari']>>> cv_index['po']['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa', 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', 'kapokapora', 'kapokaporo', 'kapokaporo', 'kapokari', 'kapokarito', 'kapokoa', 'kapoo', 'kapooto', 'kapoovira', 'kapopaa', 'kaporo', 'kaporo', 'kaporopa', 'kaporoto', 'kapoto', 'karokaropo', 'karopo', 'kepo', 'kepoi', 'keposi', 'kepoto']StemmingSimple approach:>>> def stem(word):... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:... if word.endswith(suffix):... return word[:-len(suffix)]... return wordBuilding a stemmerBuild a disjunction of all suffixes

Take a look. What do we have here?r raw string. Interpret everything just as what you see.^ from the beginning . match anything* repeat the match anything 0 or more times(ing|ly|ed|ious|ies|ive|es|s|ment) look for one of these$ at the end of the stringprocessing -- the stringresult =

re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')['ing']To get the whole wordNeed to add ?: >>> re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')['processing']Split the word into stem and suffixSome subtleties involved>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')[('process', 'ing')]Looks ok, but >>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')[('processe', 's')]The * is a greedy operator. It takes as much as it can get.>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')[('process', 'es')]*? is non greedy version. >>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')[('language', '')]? makes the suffix list optional, matches when none presentA stemming function>>> def stem(word):... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'... stem, suffix = re.findall(regexp, word)[0]... return stem...>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords... is no basis for a system of government. Supreme executive power derives from... a mandate from the masses, not from some farcical aquatic ceremony.""">>> tokens = nltk.word_tokenize(raw)>>> [stem(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond','distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']Note some strange words returned as the stem: basi from basis and deriv and execut etc.The Porter StemmerOfficial home: http://tartarus.org/martin/PorterStemmer/index-old.htmlThe python versionhttp://tartarus.org/martin/PorterStemmer/python.txt>>> from nltk.corpus import gutenberg, nps_chat>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))>>> moby.findall(r" () ") monied; nervous; dangerous; white; white; white; pious; queer; good;mature; white; Cape; great; wise; wise; butterless; white; fiendish;pale; furious; better; certain; complete; dismasted; younger; brave;brave; brave; brave

>>> chat = nltk.Text(nps_chat.words())>>> chat.findall(r" ") you rule bro; telling you bro; u twizted bro

>>> chat.findall(r"{3,}") lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; lala la; lovely lol lol love; lol lol lol.; la la la; la la la( ) means only that part is returnedre.showCo{l}or{l}ess green ideas s{l}eep furious{l}yColorless {gree}n ideas sleep furiouslyimport nltk, resent = "Colorless green ideas sleep furiously"nltk.re_show('l',sent)

nltk.re_show('gree',sent)Word patterns >>> from nltk.corpus import brown>>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))>>> hobbies_learned.findall(r" ")speed and other activities; water and other liquids; tomb and otherlandmarks; Statues and other monuments; pearls and other jewels;charts and other items; roads and other features; figures and otherobjects; military and other areas; demands and other factors;abstracts and other compilations; iron and other metalsSpot CheckHow would you find all instances of the pattern as x as yexample: as easy as pieCan you handle this: as pretty as a pictureMore on Stemming>>> porter = nltk.PorterStemmer()>>> lancaster = nltk.LancasterStemmer()>>> [porter.stem(t) for t in tokens]['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond','distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']

>>> [lancaster.stem(t) for t in tokens]['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut','sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem','execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not','from', 'som', 'farc', 'aqu', 'ceremony', '.'] >>> wnl = nltk.WordNetLemmatizer()>>> [wnl.lemmatize(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond','distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of','government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a','mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical','aquatic', 'ceremony', '.']Only keeps stems if in dictionaryTokenizingWe have done split, but it was not very complete.Built in re abbreviation for any kind of white space: \s>>> re.split(r'\s+', raw)['Dennis:', 'Listen,', 'strange', 'women', 'lying', 'in', 'ponds', 'distributing', 'swords', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses,', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony.']>>>TokenizingSplit on anything other than a word character (A-Za-z0-9) >>> re.split(r'\W+', raw)['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in','a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without','Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered','']Note: IM became I Mre.findall(r'\w+', raw)Splits on the words, instead of the separators\w+|\S\w*will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated.Getting there >>> re.findall(r'\w+|\S\w*', raw)["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',','(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does','very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that','makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']Now get internal marks M and t Regular expression symbolsSummarySymbolFunction\bWord boundary (zero width)\dAny decimal digit (equivalent to [0-9])\DAny non-digit character (equivalent to [^0-9])\sAny whitespace character (equivalent to [ \t\n\r\f\v]\SAny non-whitespace character (equivalent to [^ \t\n\r\f\v])\wAny alphanumeric character (equivalent to [a-zA-Z0-9_])\WAny non-alphanumeric character (equivalent to [^a-zA-Z0-9_])\tThe tab character\nThe newline characterTokenizer in Python >>> text = 'That U.S.A. poster-print costs $12.40...'>>> pattern = r'''(?x) # set flag to allow verbose regexps... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.... | \w+(-\w+)* # words with optional internal hyphens... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%... | \.\.\. # ellipsis... | [][.,;"'?():-_`] # these are separate tokens... '''>>> nltk.regexp_tokenize(text, pattern)['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']Spot Check Describe the class of strings matched by the following regular expressions.

[a-zA-Z]+[A-Z][a-z]*p[aeiou]{,2}t\d+(\.\d+)?([^aeiou][aeiou][^aeiou])*\w+|[^\w\s]+Test your answers using nltk.re_show().

Exercise for next week Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?