on-line documents 3 day 22 - 10/17/14 ling 3820 & 6820 natural language processing harry howard...

19
On-line documents 3 Day 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Upload: jemima-daniels

Post on 29-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

On-line documents 3Day 22 - 10/17/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

17-Oct-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Page 3: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

17-Oct-2014

3

NLP, Prof. Howard, Tulane University

Page 4: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

How to download a file from Project Gutenberg

Review

17-Oct-2014

4

NLP, Prof. Howard, Tulane University

Page 5: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

The global working directory2.2.1. How to set the global working directory in Spyder

I am getting tired of constantly double-checking that Python saves my stuff to pyScripts, but fortunately Spyder can do it for us.

Open Spyder: On a Mac, click on the python menu (top left). In Windows, click on the Tools menu.

Open Preferences > Global working directory > Startup … At startup, the global working directory is: > the following directory: /Users/harryhow/Documents/pyScripts

Set the next two selections to "the global working directory".

Leave the last untouched & unchecked.17-Oct-2014NLP, Prof. Howard, Tulane University

5

Page 6: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

gutenLoader

def gutenLoader(url, name):1.from urllib import urlopen2.download = urlopen(url)3.text = download.read()4.print 'text length = '+str(len(text))5.lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK')6.startIndex = text.index('\n',lineIndex)7.endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK')8.story = text[startIndex:endIndex]9.print 'story length = '+str(len(story))10.tempFile = open(name,'w')11.tempFile.write(story.encode('utf8'))12.tempFile.close()13.print 'File saved'14.return

17-Oct-2014NLP, Prof. Howard, Tulane University

6

Page 7: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Usage

1. # make sure that the current working directory is pyScripts

2. >>> from corpFunctions import gutenLoader

3. >>> url = 'http://www.gutenberg.org/cache/epub/32154/pg32154.txt'

4. >>> name = 'VariableMan.txt'

5. >>> gutenLoader(url, name)

17-Oct-2014NLP, Prof. Howard, Tulane University

7

Page 8: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Homework

Since the documents in Project Gutenberg are numbered consecutively, you can download a series of them all at once with a for loop. Write the code for doing so.

HINT: remember that strings and intergers are different types. To turn a string into an integer, use the built-in str() method, e.g. str(1).

17-Oct-2014NLP, Prof. Howard, Tulane University

8

Page 9: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Answer

1. # make sure that the current working directory is pyScripts

2. >>> from corpFunctions import gutenLoader

3. >>> base = 28554

4. >>> for n in [1,2,3]:

5. ... num = str(base+n)

6. ... url = 'http://www.gutenberg.org/cache/epub/'+num+'/pg'+num+'.txt'

7. ... name = 'story'+str(n)+'.txt'

8. ... gutenLoader(url, name)

9. ...

17-Oct-2014NLP, Prof. Howard, Tulane University

9

Page 10: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Get pdfminerYou only do this once ~ 2.7.3.2. How to install a package by hand

Point your web browser at https://pypi.python.org/pypi/pdfminer/.

Click on the green button to download the compressed folder.

It download to your Downloads folder. Double click it to decompress it if it doesn't decompress automatically. If you have no decompression utility, you need to get one.

Open the Terminal/Command Prompt (Windows Start > Search > cmd > Command Prompt)

$> cd {drop file here} $> python setup.py install $> pdf2txt.py samples/simple1.pdf

17-Oct-2014NLP, Prof. Howard, Tulane University

10

Page 11: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Get a pdf file

To get an idea of the source: http://www.federalreserve.gov/monetarypolicy/fomchistorical2008.htm

Download the file to pyScripts: 1. # make sure that the current working directory

is pyScripts2. from urllib import urlopen3. url =

'http://www.federalreserve.gov/monetarypolicy/files/FOMC20080130meeting.pdf'

4. download = urlopen(url)5. doc = download.read()6. tempFile = open('FOMC20080130.pdf','w')7. tempFile.write(doc)8. tempFile.close()

17-Oct-2014NLP, Prof. Howard, Tulane University

11

Page 12: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Run pdf2text

1. # make sure that Python is looking at pyScripts

2. >>> from corpFunctions import pdf2text

3. >>> text = pdf2text('FOMC20080130.pdf')

4. >>> len(text)

5. >>> text[:50]

17-Oct-2014NLP, Prof. Howard, Tulane University

12

Page 13: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7.3.2. How to pre-process a text with the PlaintextCorpusReader

17-Oct-2014

13

NLP, Prof. Howard, Tulane University

Page 14: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

NLTK

One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text amenable to computational analysis. It does so by including a module of corpus readers, which pre-process files for certain tasks or formats.

Most of them are specialized for particular corpora, so we will start with the basic one, called the PlaintextCorpusReader.

17-Oct-2014NLP, Prof. Howard, Tulane University

14

Page 15: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

PlaintextCorpusReader

The PlaintextCorpusReader needs to know two things: where your file is and what its name is.

If the current working directory is where the file is, the location argument can be left ‘blank’ by using the null string ''.

We only have one file, ‘Wub.txt’. It will also prevent problems down the line to give the method an optional third argument that relays its encoding, encoding='utf-8'.

Now let NLTK tokenize the text into words and punctuation.

17-Oct-2014NLP, Prof. Howard, Tulane University

15

Page 16: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Usage

1. # make sure that the current working directory is pyScripts

2. >>> from nltk.corpus import PlaintextCorpusReader

3. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8')

4. >>> wubWords = wubReader.words()

17-Oct-2014NLP, Prof. Howard, Tulane University

16

Page 17: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Initial look at the text

1. >>> len(wubWords)

2. >>> wubWords[:50]

3. >>> set(wubWords)

4. >>> len(set(wubWords))

5. >>> wubWords.count('wub')

6. >>> len(wubWords) / len(set(wubWords))

7. >>> from __future__ import division

8. >>> 100 * wubWords.count('a') / len(wubWords)

17-Oct-2014NLP, Prof. Howard, Tulane University

17

Page 18: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Basic text analysis with NLTK1. >>> from nltk.text import Text

2. >>> t = Text(wubWords)

3. >>> t.concordance('wub')

4. >>> t.similar('wub')

5. >>> t.common_contexts(['wub','captain'])

6. >>> t.dispersion_plot(['wub'])

7. >>> t.generate()

17-Oct-2014NLP, Prof. Howard, Tulane University

18

Page 19: ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Q6 take homeIntro to text stats

Next time

17-Oct-2014NLP, Prof. Howard, Tulane University

19