on-line documents 3 day 22 - 10/17/14 ling 3820 & 6820 natural language processing harry howard...

On-line documents 3Day 22 - 10/17/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

17-Oct-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Open Spyder

17-Oct-2014

3

NLP, Prof. Howard, Tulane University

How to download a file from Project Gutenberg

Review

17-Oct-2014

4


The global working directory2.2.1. How to set the global working directory in Spyder

I am getting tired of constantly double-checking that Python saves my stuff to pyScripts, but fortunately Spyder can do it for us.

Open Spyder: On a Mac, click on the python menu (top left). In Windows, click on the Tools menu.

Open Preferences > Global working directory > Startup … At startup, the global working directory is: > the following directory: /Users/harryhow/Documents/pyScripts

Set the next two selections to "the global working directory".

Leave the last untouched & unchecked.17-Oct-2014NLP, Prof. Howard, Tulane University

5

gutenLoader

def gutenLoader(url, name):1.from urllib import urlopen2.download = urlopen(url)3.text = download.read()4.print 'text length = '+str(len(text))5.lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK')6.startIndex = text.index('\n',lineIndex)7.endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK')8.story = text[startIndex:endIndex]9.print 'story length = '+str(len(story))10.tempFile = open(name,'w')11.tempFile.write(story.encode('utf8'))12.tempFile.close()13.print 'File saved'14.return


6

Usage

1. # make sure that the current working directory is pyScripts

2. >>> from corpFunctions import gutenLoader

3. >>> url = 'http://www.gutenberg.org/cache/epub/32154/pg32154.txt'

4. >>> name = 'VariableMan.txt'

5. >>> gutenLoader(url, name)


7

Homework

Since the documents in Project Gutenberg are numbered consecutively, you can download a series of them all at once with a for loop. Write the code for doing so.

HINT: remember that strings and intergers are different types. To turn a string into an integer, use the built-in str() method, e.g. str(1).


8

Answer


2. >>> from corpFunctions import gutenLoader

3. >>> base = 28554

4. >>> for n in [1,2,3]:

5. ... num = str(base+n)

6. ... url = 'http://www.gutenberg.org/cache/epub/'+num+'/pg'+num+'.txt'

7. ... name = 'story'+str(n)+'.txt'

8. ... gutenLoader(url, name)

9. ...


9

Get pdfminerYou only do this once ~ 2.7.3.2. How to install a package by hand

Point your web browser at https://pypi.python.org/pypi/pdfminer/.

Click on the green button to download the compressed folder.

It download to your Downloads folder. Double click it to decompress it if it doesn't decompress automatically. If you have no decompression utility, you need to get one.

Open the Terminal/Command Prompt (Windows Start > Search > cmd > Command Prompt)

$> cd {drop file here} $> python setup.py install $> pdf2txt.py samples/simple1.pdf


10

Get a pdf file

To get an idea of the source: http://www.federalreserve.gov/monetarypolicy/fomchistorical2008.htm

Download the file to pyScripts: 1. # make sure that the current working directory

is pyScripts2. from urllib import urlopen3. url =

'http://www.federalreserve.gov/monetarypolicy/files/FOMC20080130meeting.pdf'

4. download = urlopen(url)5. doc = download.read()6. tempFile = open('FOMC20080130.pdf','w')7. tempFile.write(doc)8. tempFile.close()


11

Run pdf2text

1. # make sure that Python is looking at pyScripts

2. >>> from corpFunctions import pdf2text

3. >>> text = pdf2text('FOMC20080130.pdf')

4. >>> len(text)

5. >>> text[:50]


12

7.3.2. How to pre-process a text with the PlaintextCorpusReader

17-Oct-2014

13


NLTK

One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text amenable to computational analysis. It does so by including a module of corpus readers, which pre-process files for certain tasks or formats.

Most of them are specialized for particular corpora, so we will start with the basic one, called the PlaintextCorpusReader.


14

PlaintextCorpusReader

The PlaintextCorpusReader needs to know two things: where your file is and what its name is.

If the current working directory is where the file is, the location argument can be left ‘blank’ by using the null string ''.

We only have one file, ‘Wub.txt’. It will also prevent problems down the line to give the method an optional third argument that relays its encoding, encoding='utf-8'.

Now let NLTK tokenize the text into words and punctuation.


15

Usage


2. >>> from nltk.corpus import PlaintextCorpusReader

3. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8')

4. >>> wubWords = wubReader.words()


16

Initial look at the text

1. >>> len(wubWords)

2. >>> wubWords[:50]

3. >>> set(wubWords)

4. >>> len(set(wubWords))

5. >>> wubWords.count('wub')

6. >>> len(wubWords) / len(set(wubWords))

7. >>> from __future__ import division

8. >>> 100 * wubWords.count('a') / len(wubWords)


17

Basic text analysis with NLTK1. >>> from nltk.text import Text

2. >>> t = Text(wubWords)

3. >>> t.concordance('wub')

4. >>> t.similar('wub')

5. >>> t.common_contexts(['wub','captain'])

6. >>> t.dispersion_plot(['wub'])

7. >>> t.generate()


18

Q6 take homeIntro to text stats

Next time


19