web text day 34 - 11/14/14 ling 3820 & 6820 natural language processing harry howard tulane...

Web textDay 34 - 11/14/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

14-Nov-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Open Spyder

14-Nov-2014

3

NLP, Prof. Howard, Tulane University

Twitter

Review

14-Nov-2014

4


Finding text on the web

14-Nov-2014

5


http://sethgodin.typepad.com/


6

Firefox: Tools > web developer > Page sourceSafari: Prefs > Advanced > Show develop >> show page

source <div class="entry-body"> <p>If

someone asked you how to do something …. By all means, you still need pictures, even video. But there's nothing to replace the specificity that comes from the alphabet. Use labels. Use words.</p> </div>


7

We need

requests % pip install feedparser % pip install BeautifulSoup4


8

Get the text

1. import requests

2. from bs4 import BeautifulSoup

3. url = 'http://sethgodin.typepad.com/'

4. html = requests.get(url).text

5. soup = BeautifulSoup(html)

6. print soup.find("div", {"class":"entry-body"}).text.encode('utf8')


9

Install feedparser by hand

https://pypi.python.org/pypi/feedparser click on Downloads button choose .zip file $ cd

/Users/harryhow/Downloads/feedparser-5.1.3

$ python setup.py install


10

Get the RSS feed

1. from bs4 import BeautifulSoup2. import feedparser3. url = 'feed://feeds.feedblitz.com/sethsblog'4. fp = feedparser.parse(url)5. print "Fetched %s entries from '%s'" %

(len(fp.entries), fp.feed.title)6. blog_posts = []7. for e in fp.entries:8. blog_posts.append({'title': e.title,9. 'content':

BeautifulSoup(e.content[0].value).get_text().encode('utf8'),

10. 'link': e.links[0].href})

11. print blog_posts[0]['content']


11

something elsemaybe a quiz

Next time


12

web text day 34 - 11/14/14 ling 3820 & 6820 natural language processing harry howard tulane...

Documents