on-line documents 3 day 22 - 10/17/14 ling 3820 & 6820 natural language processing harry howard...
TRANSCRIPT
On-line documents 3Day 22 - 10/17/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
17-Oct-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. http://www.tulane.edu/~howard/
CompCultEN/ Chapter numbering
3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode
characters 6. Control
Open Spyder
17-Oct-2014
3
NLP, Prof. Howard, Tulane University
How to download a file from Project Gutenberg
Review
17-Oct-2014
4
NLP, Prof. Howard, Tulane University
The global working directory2.2.1. How to set the global working directory in Spyder
I am getting tired of constantly double-checking that Python saves my stuff to pyScripts, but fortunately Spyder can do it for us.
Open Spyder: On a Mac, click on the python menu (top left). In Windows, click on the Tools menu.
Open Preferences > Global working directory > Startup … At startup, the global working directory is: > the following directory: /Users/harryhow/Documents/pyScripts
Set the next two selections to "the global working directory".
Leave the last untouched & unchecked.17-Oct-2014NLP, Prof. Howard, Tulane University
5
gutenLoader
def gutenLoader(url, name):1.from urllib import urlopen2.download = urlopen(url)3.text = download.read()4.print 'text length = '+str(len(text))5.lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK')6.startIndex = text.index('\n',lineIndex)7.endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK')8.story = text[startIndex:endIndex]9.print 'story length = '+str(len(story))10.tempFile = open(name,'w')11.tempFile.write(story.encode('utf8'))12.tempFile.close()13.print 'File saved'14.return
17-Oct-2014NLP, Prof. Howard, Tulane University
6
Usage
1. # make sure that the current working directory is pyScripts
2. >>> from corpFunctions import gutenLoader
3. >>> url = 'http://www.gutenberg.org/cache/epub/32154/pg32154.txt'
4. >>> name = 'VariableMan.txt'
5. >>> gutenLoader(url, name)
17-Oct-2014NLP, Prof. Howard, Tulane University
7
Homework
Since the documents in Project Gutenberg are numbered consecutively, you can download a series of them all at once with a for loop. Write the code for doing so.
HINT: remember that strings and intergers are different types. To turn a string into an integer, use the built-in str() method, e.g. str(1).
17-Oct-2014NLP, Prof. Howard, Tulane University
8
Answer
1. # make sure that the current working directory is pyScripts
2. >>> from corpFunctions import gutenLoader
3. >>> base = 28554
4. >>> for n in [1,2,3]:
5. ... num = str(base+n)
6. ... url = 'http://www.gutenberg.org/cache/epub/'+num+'/pg'+num+'.txt'
7. ... name = 'story'+str(n)+'.txt'
8. ... gutenLoader(url, name)
9. ...
17-Oct-2014NLP, Prof. Howard, Tulane University
9
Get pdfminerYou only do this once ~ 2.7.3.2. How to install a package by hand
Point your web browser at https://pypi.python.org/pypi/pdfminer/.
Click on the green button to download the compressed folder.
It download to your Downloads folder. Double click it to decompress it if it doesn't decompress automatically. If you have no decompression utility, you need to get one.
Open the Terminal/Command Prompt (Windows Start > Search > cmd > Command Prompt)
$> cd {drop file here} $> python setup.py install $> pdf2txt.py samples/simple1.pdf
17-Oct-2014NLP, Prof. Howard, Tulane University
10
Get a pdf file
To get an idea of the source: http://www.federalreserve.gov/monetarypolicy/fomchistorical2008.htm
Download the file to pyScripts: 1. # make sure that the current working directory
is pyScripts2. from urllib import urlopen3. url =
'http://www.federalreserve.gov/monetarypolicy/files/FOMC20080130meeting.pdf'
4. download = urlopen(url)5. doc = download.read()6. tempFile = open('FOMC20080130.pdf','w')7. tempFile.write(doc)8. tempFile.close()
17-Oct-2014NLP, Prof. Howard, Tulane University
11
Run pdf2text
1. # make sure that Python is looking at pyScripts
2. >>> from corpFunctions import pdf2text
3. >>> text = pdf2text('FOMC20080130.pdf')
4. >>> len(text)
5. >>> text[:50]
17-Oct-2014NLP, Prof. Howard, Tulane University
12
7.3.2. How to pre-process a text with the PlaintextCorpusReader
17-Oct-2014
13
NLP, Prof. Howard, Tulane University
NLTK
One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text amenable to computational analysis. It does so by including a module of corpus readers, which pre-process files for certain tasks or formats.
Most of them are specialized for particular corpora, so we will start with the basic one, called the PlaintextCorpusReader.
17-Oct-2014NLP, Prof. Howard, Tulane University
14
PlaintextCorpusReader
The PlaintextCorpusReader needs to know two things: where your file is and what its name is.
If the current working directory is where the file is, the location argument can be left ‘blank’ by using the null string ''.
We only have one file, ‘Wub.txt’. It will also prevent problems down the line to give the method an optional third argument that relays its encoding, encoding='utf-8'.
Now let NLTK tokenize the text into words and punctuation.
17-Oct-2014NLP, Prof. Howard, Tulane University
15
Usage
1. # make sure that the current working directory is pyScripts
2. >>> from nltk.corpus import PlaintextCorpusReader
3. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8')
4. >>> wubWords = wubReader.words()
17-Oct-2014NLP, Prof. Howard, Tulane University
16
Initial look at the text
1. >>> len(wubWords)
2. >>> wubWords[:50]
3. >>> set(wubWords)
4. >>> len(set(wubWords))
5. >>> wubWords.count('wub')
6. >>> len(wubWords) / len(set(wubWords))
7. >>> from __future__ import division
8. >>> 100 * wubWords.count('a') / len(wubWords)
17-Oct-2014NLP, Prof. Howard, Tulane University
17
Basic text analysis with NLTK1. >>> from nltk.text import Text
2. >>> t = Text(wubWords)
3. >>> t.concordance('wub')
4. >>> t.similar('wub')
5. >>> t.common_contexts(['wub','captain'])
6. >>> t.dispersion_plot(['wub'])
7. >>> t.generate()
17-Oct-2014NLP, Prof. Howard, Tulane University
18
Q6 take homeIntro to text stats
Next time
17-Oct-2014NLP, Prof. Howard, Tulane University
19