web text day 34 - 11/14/14 ling 3820 & 6820 natural language processing harry howard tulane...

12
Web text Day 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Upload: cecil-hood

Post on 05-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Web textDay 34 - 11/14/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

14-Nov-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Page 3: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

14-Nov-2014

3

NLP, Prof. Howard, Tulane University

Page 4: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Twitter

Review

14-Nov-2014

4

NLP, Prof. Howard, Tulane University

Page 5: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Finding text on the web

14-Nov-2014

5

NLP, Prof. Howard, Tulane University

Page 6: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

http://sethgodin.typepad.com/

14-Nov-2014NLP, Prof. Howard, Tulane University

6

Page 7: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Firefox: Tools > web developer > Page sourceSafari: Prefs > Advanced > Show develop >> show page

source <div class="entry-body"> <p>If

someone asked you how to do something …. By all means, you still need pictures, even video. But there&#39;s nothing to replace the specificity that comes from the alphabet. Use labels. Use words.</p> </div><!-- .entry-body -->

14-Nov-2014NLP, Prof. Howard, Tulane University

7

Page 8: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

We need

requests % pip install feedparser % pip install BeautifulSoup4

14-Nov-2014NLP, Prof. Howard, Tulane University

8

Page 9: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Get the text

1. import requests

2. from bs4 import BeautifulSoup

3. url = 'http://sethgodin.typepad.com/'

4. html = requests.get(url).text

5. soup = BeautifulSoup(html)

6. print soup.find("div", {"class":"entry-body"}).text.encode('utf8')

14-Nov-2014NLP, Prof. Howard, Tulane University

9

Page 10: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Install feedparser by hand

https://pypi.python.org/pypi/feedparser click on Downloads button choose .zip file $ cd

/Users/harryhow/Downloads/feedparser-5.1.3

$ python setup.py install

14-Nov-2014NLP, Prof. Howard, Tulane University

10

Page 11: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Get the RSS feed

1. from bs4 import BeautifulSoup2. import feedparser3. url = 'feed://feeds.feedblitz.com/sethsblog'4. fp = feedparser.parse(url)5. print "Fetched %s entries from '%s'" %

(len(fp.entries), fp.feed.title)6. blog_posts = []7. for e in fp.entries:8. blog_posts.append({'title': e.title,9. 'content':

BeautifulSoup(e.content[0].value).get_text().encode('utf8'),

10. 'link': e.links[0].href})

11. print blog_posts[0]['content']

14-Nov-2014NLP, Prof. Howard, Tulane University

11

Page 12: WEB TEXT DAY 34 - 11/14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

something elsemaybe a quiz

Next time

14-Nov-2014NLP, Prof. Howard, Tulane University

12