on-line documents day 20 - 10/13/14 ling 3820 & 6820 natural language processing harry howard...

15
On-line documents Day 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Upload: ginger-booker

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

Basic text analysis The NLTK archive 13-Oct NLP, Prof. Howard, Tulane University

TRANSCRIPT

Page 1: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

On-line documentsDay 20 - 10/13/14LING 3820 & 6820Natural Language ProcessingHarry HowardTulane University

Page 2: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

13-Oct-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Page 3: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Basic text analysis

The NLTK archive

13-Oct-2014

3

NLP, Prof. Howard, Tulane University

Page 4: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

13-Oct-2014

4

NLP, Prof. Howard, Tulane University

Page 5: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Now that you have gotten a taste of Python, let us turn to the main course, textual computing or the computational analysis of text. But we do not have a text to work with yet, so let’s go and find one.

7. Corpora of digital texts

13-Oct-2014

5

NLP, Prof. Howard, Tulane University

Page 6: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

The first step is to figure out where to put the file.

7.1. How to get a text from an on-line archive

13-Oct-2014

6

NLP, Prof. Howard, Tulane University

Page 7: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7.1.1. How to navigate folders with os1. >>> import os 2. >>> os.getcwd() 3. '/Applications/IDEs/Spyder.app/Contents/Resources' 4. # if the path is not to your pyScripts folder, then change it:5. >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts/') 6. >>> os.getcwd() 7. '/Users/{your_user_name}/Documents/pyScripts/'

13-Oct-2014NLP, Prof. Howard, Tulane University

7

Page 8: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7.1.2. Project Gutenberghttp://www.gutenberg.org/ebooks/28554

13-Oct-2014NLP, Prof. Howard, Tulane University

8

Page 9: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7.1.3. How to download a file with urllib and convert it to a string with read()1. >>> from urllib import urlopen 2. >>> url =

'http://www.gutenberg.org/cache/epub/28554/pg28554.txt'

3. >>> download = urlopen(url) 4. >>> downloadString = download.read() 5. >>> type(downloadString) 6. >>> len(downloadString) # 35739?7. >>> downloadString[:50]

13-Oct-2014NLP, Prof. Howard, Tulane University

9

Page 10: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7.1.4. How to save a file to your drive with open(), write(), and close() # it is assumed that Python is looking at your

pyScripts folder >>> tempFile = open('Wub.txt','w') >>> tempFile.write(downloadString.encode('utf8')) >>> tempFile.close() # import os if you haven't already done so >>> os.listdir('.')

13-Oct-2014NLP, Prof. Howard, Tulane University

10

Page 11: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7.1.5. How to look at a file with open() and read()

1. >>> tempFile = open('Wub.txt','r') 2. >>> text = tempFile.read() 3. >>> type(text) 4. >>> len(text) 5. >>> text[:50]

13-Oct-2014NLP, Prof. Howard, Tulane University

11

Page 12: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

7.1.6. How to slice away what you don’t need1. >>> text.index('*** START OF THIS PROJECT

GUTENBERG EBOOK') 2. 4993. >>> lineIndex = text.index('*** START OF THIS

PROJECT GUTENBERG EBOOK') 4. >>> startIndex = text.index('\n',lineIndex) 5. >>> text[:startIndex]6. >>> text.index('*** END OF THIS PROJECT

GUTENBERG EBOOK')7. >>> endIndex = text.index('*** END OF THIS

PROJECT GUTENBERG EBOOK') 8. >>> story = text[startIndex:endIndex]

13-Oct-2014NLP, Prof. Howard, Tulane University

12

Page 13: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Now save it as “Wub.txt”1. # it is assumed that Python is looking at

your pyScripts folder 2. >>> tempFile = open('Wub.txt','w') 3. >>> tempFile.write(story.encode('utf8')) 4. >>> tempFile.close()

13-Oct-2014NLP, Prof. Howard, Tulane University

13

Page 14: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Homework Turn the commands reviewed above into

a function in a script that takes a url and the name of a text file as arguments and results in a Project Gutenberg file being saved to your pyScripts folder without the Project Gutenberg header & footer.

13-Oct-2014NLP, Prof. Howard, Tulane University

14

Page 15: ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

How to use PDF files

Next time

13-Oct-2014NLP, Prof. Howard, Tulane University

15