-
CIS 192: Lecture 8HTML Parsing
Lili Dworkin
University of Pennsylvania
-
HTTP Requests
Use the requests library to make HTTP requests:
>>> import requests
>>> url = "http://www.cis.upenn.edu/~cis192/
spring2014/"
>>> req = requests.get(url)
>>> req
-
HTTP Requests
response object has lots of useful attributes / methods:
I url
I text
I headers
I cookies
I status code
I json()
-
HTTP Requests
Get the HTML source:
>>> source = req.text
>>> source[:99]
u'\n\n \n <meta charset=''utf-8''>\n CIS192'>>> print source[:99]
CIS192
-
Unicode
The object returned has type unicode, not str:
>>> type(req.text)
>>> unicode('hello')u'hello'>>> str(unicode('hello'))'hello'
-
Status Codes
>>> url = "http://www.cis.upenn.edu/~cis192/
spring2014/"
>>> req = requests.get(url)
>>> req.status_code
200
-
Status Codes
I 2xx – success
I 3xx – redirection
I 4xx – client error
I 5xx – server error
-
Status Codes
A “bad” status code won’t thrown an error in your code, so yourun the risk of thinking things worked when they actually didn’t:
>>> url = 'http://httpbin.org/hidden-basic-auth'>>> req = requests.get(url)
>>> # try do stuff with req.text
>>> # doesn't work! because:>>> req.status_code
404
-
Status Codes
Either check for the error directly, or use raise_for_error():
>>> url = 'http://httpbin.org/hidden-basic-auth'>>> req = requests.get(url)
>>> if req.status_code != 200:
... raise Exception()
OR
>>> url = 'http://httpbin.org/hidden-basic-auth'>>> req = requests.get(url)
>>> req.raise_for_status()
Traceback (most recent call last):
...
requests.exceptions.HTTPError: 404 Client Error
-
Requests with Parameters
I requests.get() issues a GET request
I GET requests can take parameters
I The query strings are sent in the url, e.g:
>>> req = requests.get('.../test?name1=value1&name2=value2')
I But you shouldn’t do the formatting yourself:
>>> params = {'name1':'value1', 'name2':'value2'}>>> req = requests.get('.../test', params=params)
-
httpbin
Let’s practice a bit on httpbin.org, a HTTP request andresponse test service.
-
HTML Structure
The Dormouse Story
The Dormouse Story
Elsie
Lacie
-
BeautifulSoup
Use BeautifulSoup to parse HTML:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(req.text) # html source
>>> soup = BeautifulSoup(html_doc) # html string
>>> type(soup)
>>>
-
Tag Objects
A Tag object corresponds to an HTML tag:
>>> soup.p
The Dormouse Story
>>> type(soup.p)
Tags have names and attributes:
>>> tag = soup.p
>>> tag.name
u'p'>>> tag.attrs
{u'class': [u'title']}>>> tag['class'][u'title']
-
Pretty Printing
Use prettify() to pretty print a tag (can also call this on theentire soup object, but this can get long):
>>> print soup.p.prettify()
The Dormouse Story
-
Navigating
Say the name of the tag you want:
>>> soup.head
The Dormouse Story
Zooming in:
>>> soup.head.title
The Dormouse Story
Using a tag name as an attribute gets the first tag by that name:
>>> soup.a
Elsie
-
Navigating: Going Down
Getting a tag’s children:
>>> soup.p
The Dormouse Story
>>> soup.p.contents
[The Dormouse Story]
>>> soup.p.children
>>> [i for i in soup.p.children]
[The Dormouse Story]
-
Navigating: Going Down
Note that the child might also have a child ...
>>> soup.p
The Dormouse Story
>>> soup.p.contents
[The Dormouse Story]
>>> child = soup.p.contents[0]
>>> child.contents
[u'The Dormouse Story']
-
Navigating: Going Down
To search recursively and get children of children of children, etc,use descendants:
>>> for i in soup.p.descendants:
... print i
The Dormouse Story
The Dormouse Story
-
Navigating: Going Up
Getting a tag’s parent:
>>> soup.title
The Dormouse Story
>>> soup.title.parent
The Dormouse Story
-
Navigating: Going Sideways
Getting a tag’s siblings:
>>> soup.a
Elsie
>>> soup.a.next_sibling
Lacie
-
NavigableString Objects
A NavigableString object corresponds to a bit of text within a tag:
>>> soup.p
The Dormouse Story
>>> soup.p.string
u'The Dormouse Story'>>> type(soup.p.string)
-
NavigableString Objects
Does a lot more than regular string:
>>> soup.p.string.parent
The Dormouse Story
So if all you need is the text, you should convert:
>>> str(soup.p.string)
'The Dormouse Story'
-
NavigableString Objects
Sometimes it’s helpful to check whether what you’re dealing withis a tag or a NavigableString:
>>> from bs4 import BeautifulSoup, NavigableString
>>> [i for i in soup.descendants
if isinstance(i, NavigableString)]
[u'The Dormouse Story', u'The Dormouse Story', u'Elsie', u'Lacie']
-
NavigableString Objects
Another option: use .text:
>>> soup.p
The Dormouse Story
>>> soup.p.text
u'The Dormouse Story'>>> type(soup.p.text)
This actually has different behavior:
>>> soup.string
>>>
>>> soup.text
u'The Dormouse StoryThe Dormouse StoryElsieLacie'
-
Searching via Strings
Find all
tags:
>>> soup.find_all('p')[
The Dormouse Story
,ElsieLacie
]
-
Searching via Regular Expressions
Find all tags that start with “b”:
>>> import re
>>> soup.find_all(re.compile("^b"))
[
The Dormouse Story
/pp>Elsie
Lacie
, The Dormouse Story]
-
Searching via Lists
Find all or tags:
>>> soup.find_all(['a', 'b'])[The Dormouse Story, Elsie, Lacie]
-
Searching via Functions
Find all tags that have “class” but not “id” attributes:
>>> def has_class_but_no_id(tag):
... return 'class' in tag.attrs and \
... 'id' not in tag.attrs
...
>>> soup.find_all(has_class_but_no_id)
[
The Dormouse Story
,ElsieLacie
]
-
Searching via Keyword Argument
>>> soup.find_all(id="link2")
[Lacie]
>>> soup.find_all(href=re.compile("^http(.)*elsie"))
[Elsie]
I You can filter an attribute based on a string, a regularexpression, a list, a function, or the value True.
I You can filter multiple attributes at once by passing in morethan one keyword argument.
-
Searching via CSS Class
“Class” is a reserved word in Python!
>>> soup.find_all(class="sister")
SyntaxError
>>> soup.find_all(class_="sister")
[Elsie, Lacie]
-
Amazon Reviews
Let’s get a list of Amazon Reviews for Dive into Python.
-
Quiz
1. Consider the code l = (i for i in range(10))I What happens if I type l[0]?I What about l.next()?I What about sum(l)?I What about len(l)?
2. Write code to return a list containing all pairs of items fromthe list [’a’, ’b’, ’c’, ’d’].
3. Write a regular expression to match a date of the formMM-DD-YYYY or MM-DD-YY. Don’t worry about makingsure that the numbers are “valid.”
4. If I call a python program with python test.py --nameLili www.google.com -n 10 and parse the command-linearguments with options, args =optparser.parse_args(), what will the list args contain?
5. **