Download - Scrape the Web: Strategies for programming websites that donâ€™t

Page 1: Scrape the Web: Strategies for programming websites that donâ€™t

Scrape the Web: Strategies for programmingwebsites that don’t expect it

Presenter: Asheesh Laroia, @asheeshlaroia([email protected], +1-585-506-8865)

February 18, 2010

Page 2: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 3: Scrape the Web: Strategies for programming websites that donâ€™t

Intro

Page 4: Scrape the Web: Strategies for programming websites that donâ€™t

Meta

Page 5: Scrape the Web: Strategies for programming websites that donâ€™t

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Page 6: Scrape the Web: Strategies for programming websites that donâ€™t

Hello

Page 7: Scrape the Web: Strategies for programming websites that donâ€™t

Hello

Page 8: Scrape the Web: Strategies for programming websites that donâ€™t

Hello

Page 9: Scrape the Web: Strategies for programming websites that donâ€™t

Hello

Page 10: Scrape the Web: Strategies for programming websites that donâ€™t

Hello

Page 11: Scrape the Web: Strategies for programming websites that donâ€™t

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

Page 12: Scrape the Web: Strategies for programming websites that donâ€™t

Format introduction

Page 13: Scrape the Web: Strategies for programming websites that donâ€™t

Format introduction

Page 14: Scrape the Web: Strategies for programming websites that donâ€™t

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

Page 15: Scrape the Web: Strategies for programming websites that donâ€™t

Page 16: Scrape the Web: Strategies for programming websites that donâ€™t

Page 17: Scrape the Web: Strategies for programming websites that donâ€™t

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

Page 18: Scrape the Web: Strategies for programming websites that donâ€™t

I Slow me down,

I or speed me up.

Page 19: Scrape the Web: Strategies for programming websites that donâ€™t

I Slow me down,

I or speed me up.

Page 20: Scrape the Web: Strategies for programming websites that donâ€™t

I Slow me down,

I or speed me up.

Page 21: Scrape the Web: Strategies for programming websites that donâ€™t

I Slow me down,

I or speed me up.

Page 22: Scrape the Web: Strategies for programming websites that donâ€™t

What is screen scraping?

Page 23: Scrape the Web: Strategies for programming websites that donâ€™t

Photo

Page 24: Scrape the Web: Strategies for programming websites that donâ€™t

Photo

Page 25: Scrape the Web: Strategies for programming websites that donâ€™t

Brittle?

Page 26: Scrape the Web: Strategies for programming websites that donâ€™t

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Page 27: Scrape the Web: Strategies for programming websites that donâ€™t

Page 28: Scrape the Web: Strategies for programming websites that donâ€™t

Page 29: Scrape the Web: Strategies for programming websites that donâ€™t

Page 30: Scrape the Web: Strategies for programming websites that donâ€™t

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Page 31: Scrape the Web: Strategies for programming websites that donâ€™t

Power

Page 32: Scrape the Web: Strategies for programming websites that donâ€™t

Power

Page 33: Scrape the Web: Strategies for programming websites that donâ€™t

Power

Page 34: Scrape the Web: Strategies for programming websites that donâ€™t

Independence

I Design choices and restrictions fall away.

Page 35: Scrape the Web: Strategies for programming websites that donâ€™t

Independence

I Design choices and restrictions fall away.

Page 36: Scrape the Web: Strategies for programming websites that donâ€™t

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Page 37: Scrape the Web: Strategies for programming websites that donâ€™t

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Page 38: Scrape the Web: Strategies for programming websites that donâ€™t

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Page 39: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 40: Scrape the Web: Strategies for programming websites that donâ€™t

Programming the web

Page 41: Scrape the Web: Strategies for programming websites that donâ€™t

Say

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

The Web

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Page 46: Scrape the Web: Strategies for programming websites that donâ€™t

Mac OS “say”

Page 47: Scrape the Web: Strategies for programming websites that donâ€™t

Mac OS “say”

Page 48: Scrape the Web: Strategies for programming websites that donâ€™t

Cepstral demo

Page 49: Scrape the Web: Strategies for programming websites that donâ€™t

Curry

Page 50: Scrape the Web: Strategies for programming websites that donâ€™t

Delicious

Page 51: Scrape the Web: Strategies for programming websites that donâ€™t

Curry on the web

http://mehfilindian.com/LunchMenuTakeOut.htm

Page 52: Scrape the Web: Strategies for programming websites that donâ€™t

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Page 53: Scrape the Web: Strategies for programming websites that donâ€™t

Page 54: Scrape the Web: Strategies for programming websites that donâ€™t

Page 55: Scrape the Web: Strategies for programming websites that donâ€™t

Page 56: Scrape the Web: Strategies for programming websites that donâ€™t

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

Page 57: Scrape the Web: Strategies for programming websites that donâ€™t

The easy way

Page 58: Scrape the Web: Strategies for programming websites that donâ€™t

The easy way

Page 59: Scrape the Web: Strategies for programming websites that donâ€™t

The easy way

Page 60: Scrape the Web: Strategies for programming websites that donâ€™t

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

Page 61: Scrape the Web: Strategies for programming websites that donâ€™t

Page 62: Scrape the Web: Strategies for programming websites that donâ€™t

Page 63: Scrape the Web: Strategies for programming websites that donâ€™t

Page 64: Scrape the Web: Strategies for programming websites that donâ€™t

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

Page 65: Scrape the Web: Strategies for programming websites that donâ€™t

XHTML and HTML

I It’s 2010.

Page 66: Scrape the Web: Strategies for programming websites that donâ€™t

XHTML and HTML

I It’s 2010.

Page 67: Scrape the Web: Strategies for programming websites that donâ€™t

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

Page 68: Scrape the Web: Strategies for programming websites that donâ€™t

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

Page 69: Scrape the Web: Strategies for programming websites that donâ€™t

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

Page 70: Scrape the Web: Strategies for programming websites that donâ€™t

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

Page 71: Scrape the Web: Strategies for programming websites that donâ€™t

I HTML

I vs. XHTML (2000)

I ...did XHTML win?

Page 72: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 73: Scrape the Web: Strategies for programming websites that donâ€™t

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 74: Scrape the Web: Strategies for programming websites that donâ€™t

I Average page size?

I 16.5K

I ca. 12

Page 75: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 76: Scrape the Web: Strategies for programming websites that donâ€™t

I HTML to XHTML ratio?

I 2:1

I ca. 12

Page 77: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 78: Scrape the Web: Strategies for programming websites that donâ€™t

I Transitional vs. Strict/Frameset:

I 10:1

I ca. 12

Page 79: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 80: Scrape the Web: Strategies for programming websites that donâ€™t

I How many in ”Quirks” mode?

I 85%

I ca. 12

Page 81: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 82: Scrape the Web: Strategies for programming websites that donâ€™t

I What’s more popular? TITLE or BODY?

I TITLE

I ca. 12

Page 83: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 84: Scrape the Web: Strategies for programming websites that donâ€™t

I What percent validate in general?

I ca. 4.13%

I ca. 12

Page 85: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 86: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 87: Scrape the Web: Strategies for programming websites that donâ€™t

I ca. 12

Page 88: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 89: Scrape the Web: Strategies for programming websites that donâ€™t

The web: Round one

Page 90: Scrape the Web: Strategies for programming websites that donâ€™t

Parsing considerations

Page 91: Scrape the Web: Strategies for programming websites that donâ€™t

A showcase of some of your options

I An example of valid HTML (written by hand)(examples/parsing/)

I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 92: Scrape the Web: Strategies for programming websites that donâ€™t

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)

Page 93: Scrape the Web: Strategies for programming websites that donâ€™t

(examples/parsing/)I Parsed with HTMLParser

Page 94: Scrape the Web: Strategies for programming websites that donâ€™t

Page 95: Scrape the Web: Strategies for programming websites that donâ€™t

Page 96: Scrape the Web: Strategies for programming websites that donâ€™t

Page 97: Scrape the Web: Strategies for programming websites that donâ€™t

I Parsed with xml.dom.minidom

Page 98: Scrape the Web: Strategies for programming websites that donâ€™t

Page 99: Scrape the Web: Strategies for programming websites that donâ€™t

Page 100: Scrape the Web: Strategies for programming websites that donâ€™t

I in Firefox

I In xml.dom.minidomI in HTMLParser

Page 101: Scrape the Web: Strategies for programming websites that donâ€™t

I in FirefoxI In xml.dom.minidom

I in HTMLParser

Page 102: Scrape the Web: Strategies for programming websites that donâ€™t

Page 103: Scrape the Web: Strategies for programming websites that donâ€™t

Page 104: Scrape the Web: Strategies for programming websites that donâ€™t

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Page 105: Scrape the Web: Strategies for programming websites that donâ€™t

Page 106: Scrape the Web: Strategies for programming websites that donâ€™t

Page 107: Scrape the Web: Strategies for programming websites that donâ€™t

Inspirational quote: JWZ

Some people, when confronted with a problem, think“Iknow, I’ll use regular expressions.” Now they have twoproblems.– Jamie Zawinski

Page 108: Scrape the Web: Strategies for programming websites that donâ€™t

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Page 109: Scrape the Web: Strategies for programming websites that donâ€™t

Page 110: Scrape the Web: Strategies for programming websites that donâ€™t

Page 111: Scrape the Web: Strategies for programming websites that donâ€™t

Page 112: Scrape the Web: Strategies for programming websites that donâ€™t

Page 113: Scrape the Web: Strategies for programming websites that donâ€™t

Page 114: Scrape the Web: Strategies for programming websites that donâ€™t

Inspirational quote: Jon Postel

Robustness principle: “Be conservative in what you do, be liberal inwhat you accept from others.”– Jon Postel, Transmission Control Protocol, RFC 793

Page 115: Scrape the Web: Strategies for programming websites that donâ€™t

Inspirational quote: Leonard Richardson

“You didn’t write that awful page. You’re just trying to get somedata out of it. Right now, you don’t really care what HTML issupposed to look like.“– Leonard Richardson, author of BeautifulSoup

Page 116: Scrape the Web: Strategies for programming websites that donâ€™t

Back to curry

Page 117: Scrape the Web: Strategies for programming websites that donâ€™t

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

Page 118: Scrape the Web: Strategies for programming websites that donâ€™t

Page 119: Scrape the Web: Strategies for programming websites that donâ€™t

Page 120: Scrape the Web: Strategies for programming websites that donâ€™t

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Page 121: Scrape the Web: Strategies for programming websites that donâ€™t

Model the data

I index

I name

I description

I price

Page 122: Scrape the Web: Strategies for programming websites that donâ€™t

Model the data

I index

I name

I description

I price

Page 123: Scrape the Web: Strategies for programming websites that donâ€™t

Model the data

I index

I name

I description

I price

Page 124: Scrape the Web: Strategies for programming websites that donâ€™t

Model the data

I index

I name

I description

I price

Page 125: Scrape the Web: Strategies for programming websites that donâ€™t

Model the data

I index

I name

I description

I price

Page 126: Scrape the Web: Strategies for programming websites that donâ€™t

Mini-lesson

I hand-written pages vs.

I machine-written pages

Page 127: Scrape the Web: Strategies for programming websites that donâ€™t

Mini-lesson

Page 128: Scrape the Web: Strategies for programming websites that donâ€™t

Mini-lesson

Page 129: Scrape the Web: Strategies for programming websites that donâ€™t

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

Page 130: Scrape the Web: Strategies for programming websites that donâ€™t

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

Page 131: Scrape the Web: Strategies for programming websites that donâ€™t

We’re done!

Right?

Page 132: Scrape the Web: Strategies for programming websites that donâ€™t

Trees of tags

Page 133: Scrape the Web: Strategies for programming websites that donâ€™t

What defines how HTML gets parsed?

Web browsers

Page 134: Scrape the Web: Strategies for programming websites that donâ€™t

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Page 135: Scrape the Web: Strategies for programming websites that donâ€™t

Page 136: Scrape the Web: Strategies for programming websites that donâ€™t

Page 137: Scrape the Web: Strategies for programming websites that donâ€™t

Parsing trees and finding elements

Page 138: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 139: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 140: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 141: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 142: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 143: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 144: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 145: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 146: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

I CSS Selectors...

I titleI span.title

Page 147: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

I CSS Selectors...I title

I span.title

Page 148: Scrape the Web: Strategies for programming websites that donâ€™t

Early history

I soup(“title”)

Page 149: Scrape the Web: Strategies for programming websites that donâ€™t

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 150: Scrape the Web: Strategies for programming websites that donâ€™t

Recent history

Page 151: Scrape the Web: Strategies for programming websites that donâ€™t

Recent history

Page 152: Scrape the Web: Strategies for programming websites that donâ€™t

Recent history

Page 153: Scrape the Web: Strategies for programming websites that donâ€™t

Recent history

Page 154: Scrape the Web: Strategies for programming websites that donâ€™t

Recent history

Page 155: Scrape the Web: Strategies for programming websites that donâ€™t

Recent history

Page 156: Scrape the Web: Strategies for programming websites that donâ€™t

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Page 157: Scrape the Web: Strategies for programming websites that donâ€™t

Searching tag trees

Page 158: Scrape the Web: Strategies for programming websites that donâ€™t

Searching tag trees

Page 159: Scrape the Web: Strategies for programming websites that donâ€™t

Searching tag trees

Page 160: Scrape the Web: Strategies for programming websites that donâ€™t

Searching tag trees

Page 161: Scrape the Web: Strategies for programming websites that donâ€™t

Searching tag trees

Page 162: Scrape the Web: Strategies for programming websites that donâ€™t

Interacting with the web

Page 163: Scrape the Web: Strategies for programming websites that donâ€™t

Basic Yahoo! search (hard-coded)

examples/search/yahoo.py

Page 164: Scrape the Web: Strategies for programming websites that donâ€™t

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Page 165: Scrape the Web: Strategies for programming websites that donâ€™t

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Page 166: Scrape the Web: Strategies for programming websites that donâ€™t

Something’s wrong...

Page 167: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 168: Scrape the Web: Strategies for programming websites that donâ€™t

Page 169: Scrape the Web: Strategies for programming websites that donâ€™t

A network trace of an HTTP conversation

Page 170: Scrape the Web: Strategies for programming websites that donâ€™t

User-Agent, and other headers the client sends

Page 171: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 172: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Page 173: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Page 174: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Page 175: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Page 176: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Page 177: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Page 178: Scrape the Web: Strategies for programming websites that donâ€™t

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 404 Not Found

I 410 Gone

Page 179: Scrape the Web: Strategies for programming websites that donâ€™t

HTTP methods

I GET

I POST

I PUT

I BREW

Page 180: Scrape the Web: Strategies for programming websites that donâ€™t

HTTP methods

I GET

I POST

I PUT

I BREW

Page 181: Scrape the Web: Strategies for programming websites that donâ€™t

HTTP methods

I GET

I POST

I PUT

I BREW

Page 182: Scrape the Web: Strategies for programming websites that donâ€™t

HTTP methods

I GET

I POST

I PUT

I BREW

Page 183: Scrape the Web: Strategies for programming websites that donâ€™t

HTTP methods

I GET

I POST

I PUT

I BREW

Page 184: Scrape the Web: Strategies for programming websites that donâ€™t

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Page 185: Scrape the Web: Strategies for programming websites that donâ€™t

I Cookie behavior

I Accept: headers

Page 186: Scrape the Web: Strategies for programming websites that donâ€™t

I Cookie behavior

I Accept: headers

Page 187: Scrape the Web: Strategies for programming websites that donâ€™t

I Cookie behavior

I Accept: headers

Page 188: Scrape the Web: Strategies for programming websites that donâ€™t

I Cookie behavior

I Accept: headers

Page 189: Scrape the Web: Strategies for programming websites that donâ€™t

I Cookie behavior

I Accept: headers

Page 190: Scrape the Web: Strategies for programming websites that donâ€™t

What if we settle for approximate emulation?

Page 191: Scrape the Web: Strategies for programming websites that donâ€™t

Re-do of Google search with a cooked user-agent

examples/search/urllib2-user-agent/google as ie.py

Page 192: Scrape the Web: Strategies for programming websites that donâ€™t

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Page 193: Scrape the Web: Strategies for programming websites that donâ€™t

Page 194: Scrape the Web: Strategies for programming websites that donâ€™t

Page 195: Scrape the Web: Strategies for programming websites that donâ€™t

Page 196: Scrape the Web: Strategies for programming websites that donâ€™t

HTTP: State via cookies

I HTTP implements state on top of TCP

Page 197: Scrape the Web: Strategies for programming websites that donâ€™t

HTTP: State via cookies

I HTTP implements state on top of TCP

Page 198: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

Page 199: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt

I User-agent: *

I Disallow: /

Page 200: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt

I User-agent: *

I Disallow: /

Page 201: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt

I User-agent: *

I Disallow: /

Page 202: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt

I User-agent: *

I Disallow: /

Page 203: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

Page 204: Scrape the Web: Strategies for programming websites that donâ€™t

Page 205: Scrape the Web: Strategies for programming websites that donâ€™t

Page 206: Scrape the Web: Strategies for programming websites that donâ€™t

Filling out more forms: POST and GET

(Be sure to pay attention to the clock; minute 90 is when snackbreak starts.)

Page 207: Scrape the Web: Strategies for programming websites that donâ€™t

POST: Cepstral Weather demo (by hand)

http://cepstral.com/cgi-bin/demos/weather

Page 208: Scrape the Web: Strategies for programming websites that donâ€™t

Note the URL we POST to

I from FireBug

Page 209: Scrape the Web: Strategies for programming websites that donâ€™t

Note the URL we POST to

I from FireBug

Page 210: Scrape the Web: Strategies for programming websites that donâ€™t

Note the data we POST

I from FireBug

Page 211: Scrape the Web: Strategies for programming websites that donâ€™t

Note the data we POST

I from FireBug

Page 212: Scrape the Web: Strategies for programming websites that donâ€™t

Write simple Python that also POSTs

examples/cepstral/just post.py

Page 213: Scrape the Web: Strategies for programming websites that donâ€™t

Pull out the .wav file and play it with mplayer

examples/cepstral/play wav.py

Page 214: Scrape the Web: Strategies for programming websites that donâ€™t

POST: Cepstral weather demo (via mechanize)

examples/cepstral/just post via mechanize.py

Page 215: Scrape the Web: Strategies for programming websites that donâ€™t

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Page 216: Scrape the Web: Strategies for programming websites that donâ€™t

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Page 217: Scrape the Web: Strategies for programming websites that donâ€™t

Basic Yahoo! search (via mechanize, handle robots=False)

examples/search/yahoo mechanize norobots.py

Page 218: Scrape the Web: Strategies for programming websites that donâ€™t

Basic Google! search (via mechanize,handle robots=False, changeuser-agent)

examples/search/google mechanize.py

Page 219: Scrape the Web: Strategies for programming websites that donâ€™t

Cookies

Page 220: Scrape the Web: Strategies for programming websites that donâ€™t

emusic: Log in and verify that we logged in successfully(with cookielib)(optional)

examples/cookies/emusic login byhand.py

Page 221: Scrape the Web: Strategies for programming websites that donâ€™t

emusic: Log in and verify that we logged in successfully(with mechanize)

examples/cookies/emusic login mechanize.py

Page 222: Scrape the Web: Strategies for programming websites that donâ€™t

emusic: Check how many downloads we have left (withmechanize)

examples/cookies/emusic check downloads.py

Page 223: Scrape the Web: Strategies for programming websites that donâ€™t

Now we’re done, right?

Whew.

Page 224: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 225: Scrape the Web: Strategies for programming websites that donâ€™t

Page 226: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 227: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I HTTP status codes

I Submitting forms

Page 228: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I HTTP status codes

I Submitting forms

Page 229: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I HTTP status codes

I Submitting forms

Page 230: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I HTTP status codes

I Submitting forms

Page 231: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I HTTP status codes

I Submitting forms

Page 232: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I HTTP status codes

I Submitting forms

Page 233: Scrape the Web: Strategies for programming websites that donâ€™t

Recap

We’ve seen:

I HTTP status codes

I Submitting forms

Page 234: Scrape the Web: Strategies for programming websites that donâ€™t

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

Page 235: Scrape the Web: Strategies for programming websites that donâ€™t

I robots.txt

Page 236: Scrape the Web: Strategies for programming websites that donâ€™t

I robots.txt

Page 237: Scrape the Web: Strategies for programming websites that donâ€™t

I robots.txt

Page 238: Scrape the Web: Strategies for programming websites that donâ€™t

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Page 239: Scrape the Web: Strategies for programming websites that donâ€™t

Why scrape the web?

I Anger

Page 240: Scrape the Web: Strategies for programming websites that donâ€™t

Why scrape the web?

I Anger

Page 241: Scrape the Web: Strategies for programming websites that donâ€™t

Why scrape the web?

I Anger

Page 242: Scrape the Web: Strategies for programming websites that donâ€™t

Web APIs

Page 243: Scrape the Web: Strategies for programming websites that donâ€™t

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Page 244: Scrape the Web: Strategies for programming websites that donâ€™t

Page 245: Scrape the Web: Strategies for programming websites that donâ€™t

I support grouping contacts

I status messagesI large profile imagesI notifications

Page 246: Scrape the Web: Strategies for programming websites that donâ€™t

I support grouping contactsI status messages

I large profile imagesI notifications

Page 247: Scrape the Web: Strategies for programming websites that donâ€™t

I support grouping contactsI status messagesI large profile images

I notifications

Page 248: Scrape the Web: Strategies for programming websites that donâ€™t

Page 249: Scrape the Web: Strategies for programming websites that donâ€™t

Page 250: Scrape the Web: Strategies for programming websites that donâ€™t

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Page 251: Scrape the Web: Strategies for programming websites that donâ€™t

“Sorry”

Page 252: Scrape the Web: Strategies for programming websites that donâ€™t

“Sorry”

Page 253: Scrape the Web: Strategies for programming websites that donâ€™t

“Sorry”

Page 254: Scrape the Web: Strategies for programming websites that donâ€™t

“Sorry”

Page 255: Scrape the Web: Strategies for programming websites that donâ€™t

“Sorry”

Page 256: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 257: Scrape the Web: Strategies for programming websites that donâ€™t

Parser redux

Page 258: Scrape the Web: Strategies for programming websites that donâ€™t

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Page 259: Scrape the Web: Strategies for programming websites that donâ€™t

Choosing a parser

I Performance

I Ease-of-use

I Quality

Page 260: Scrape the Web: Strategies for programming websites that donâ€™t

Choosing a parser

I Performance

I Ease-of-use

I Quality

Page 261: Scrape the Web: Strategies for programming websites that donâ€™t

Choosing a parser

I Performance

I Ease-of-use

I Quality

Page 262: Scrape the Web: Strategies for programming websites that donâ€™t

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTML

I HTML: 1998-style, or 2003-style?

Page 263: Scrape the Web: Strategies for programming websites that donâ€™t

Choosing a parser

I Performance

I Ease-of-use

I Quality

Page 264: Scrape the Web: Strategies for programming websites that donâ€™t

Benchmarks by Ian Bicking

I Benchmarks run by me this morning

I same results as Ian

Page 265: Scrape the Web: Strategies for programming websites that donâ€™t

Benchmarks by Ian BickingI Benchmarks run by me this morning

Page 266: Scrape the Web: Strategies for programming websites that donâ€™t

Benchmarks by Ian BickingI Benchmarks run by me this morning

Page 267: Scrape the Web: Strategies for programming websites that donâ€™t

Ease of use

Page 268: Scrape the Web: Strategies for programming websites that donâ€™t

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Page 269: Scrape the Web: Strategies for programming websites that donâ€™t

Tree fixups

I lxml ≈ html5lib

Page 270: Scrape the Web: Strategies for programming websites that donâ€™t

Tree fixups

I lxml ≈ html5lib

Page 271: Scrape the Web: Strategies for programming websites that donâ€™t

Tree fixups

I lxml ≈ html5lib

Page 272: Scrape the Web: Strategies for programming websites that donâ€™t

A winner

I lxml!

I ...?

Page 273: Scrape the Web: Strategies for programming websites that donâ€™t

A winner

I lxml!

I ...?

Page 274: Scrape the Web: Strategies for programming websites that donâ€™t

A winner

I lxml!

I ...?

Page 275: Scrape the Web: Strategies for programming websites that donâ€™t

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

Page 276: Scrape the Web: Strategies for programming websites that donâ€™t

I FireQuark

I CSS...

Page 277: Scrape the Web: Strategies for programming websites that donâ€™t

I FireQuark

I CSS...

Page 278: Scrape the Web: Strategies for programming websites that donâ€™t

I FireQuark

I CSS...

Page 279: Scrape the Web: Strategies for programming websites that donâ€™t

I FireQuark

I CSS...

Page 280: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 281: Scrape the Web: Strategies for programming websites that donâ€™t

Countermeasures

Page 282: Scrape the Web: Strategies for programming websites that donâ€™t

Easy

Page 283: Scrape the Web: Strategies for programming websites that donâ€™t

Imagine a really stupid bot

Page 284: Scrape the Web: Strategies for programming websites that donâ€™t

Check Referer header

I mechanize solves this

Page 285: Scrape the Web: Strategies for programming websites that donâ€™t

Check Referer header

Page 286: Scrape the Web: Strategies for programming websites that donâ€™t

Extra hidden form fields

Page 287: Scrape the Web: Strategies for programming websites that donâ€™t

Extra hidden form fields

Page 288: Scrape the Web: Strategies for programming websites that donâ€™t

Requiring cookies

Page 289: Scrape the Web: Strategies for programming websites that donâ€™t

Requiring cookies

Page 290: Scrape the Web: Strategies for programming websites that donâ€™t

Countermeasures: hard

Page 291: Scrape the Web: Strategies for programming websites that donâ€™t

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Page 292: Scrape the Web: Strategies for programming websites that donâ€™t

I Use more IPs

Page 293: Scrape the Web: Strategies for programming websites that donâ€™t

I Use more IPs

I Tor, or

I your own machines

Page 294: Scrape the Web: Strategies for programming websites that donâ€™t

I Use more IPs

Page 295: Scrape the Web: Strategies for programming websites that donâ€™t

I Use more IPs

Page 296: Scrape the Web: Strategies for programming websites that donâ€™t

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

Page 297: Scrape the Web: Strategies for programming websites that donâ€™t

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

Page 298: Scrape the Web: Strategies for programming websites that donâ€™t

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

Page 299: Scrape the Web: Strategies for programming websites that donâ€™t

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

Page 300: Scrape the Web: Strategies for programming websites that donâ€™t

Invisible countermeasures

Page 301: Scrape the Web: Strategies for programming websites that donâ€™t

Behavior profiling

I Time-based?

Page 302: Scrape the Web: Strategies for programming websites that donâ€™t

Behavior profiling

I Time-based?

Page 303: Scrape the Web: Strategies for programming websites that donâ€™t

Inserting false link visible only to bots

I “Tarpits”

Page 304: Scrape the Web: Strategies for programming websites that donâ€™t

Inserting false link visible only to bots

I “Tarpits”

Page 305: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt access

I As soon as you access it, you lose.

Page 306: Scrape the Web: Strategies for programming websites that donâ€™t

robots.txt access

I As soon as you access it, you lose.

Page 307: Scrape the Web: Strategies for programming websites that donâ€™t

Getting around IP address limits

Page 308: Scrape the Web: Strategies for programming websites that donâ€™t

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

Page 309: Scrape the Web: Strategies for programming websites that donâ€™t

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

Page 310: Scrape the Web: Strategies for programming websites that donâ€™t

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

Page 311: Scrape the Web: Strategies for programming websites that donâ€™t

ssh -D

Page 312: Scrape the Web: Strategies for programming websites that donâ€™t

ssh -D

Page 313: Scrape the Web: Strategies for programming websites that donâ€™t

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

Page 314: Scrape the Web: Strategies for programming websites that donâ€™t

socks monkey

Page 315: Scrape the Web: Strategies for programming websites that donâ€™t

socks monkey

Page 316: Scrape the Web: Strategies for programming websites that donâ€™t

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

Page 317: Scrape the Web: Strategies for programming websites that donâ€™t

tsocks

Page 318: Scrape the Web: Strategies for programming websites that donâ€™t

tsocks

Page 319: Scrape the Web: Strategies for programming websites that donâ€™t

tor

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

Page 320: Scrape the Web: Strategies for programming websites that donâ€™t

tor

I (play nice...)

Page 321: Scrape the Web: Strategies for programming websites that donâ€™t

tor

I (play nice...)

Page 322: Scrape the Web: Strategies for programming websites that donâ€™t

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Page 323: Scrape the Web: Strategies for programming websites that donâ€™t

Cycling strategies

I Drain it dry

I Round-robin

Page 324: Scrape the Web: Strategies for programming websites that donâ€™t

Cycling strategies

I Drain it dry

I Round-robin

Page 325: Scrape the Web: Strategies for programming websites that donâ€™t

Cycling strategies

I Drain it dry

I Round-robin

Page 326: Scrape the Web: Strategies for programming websites that donâ€™t

Cycling strategies

I Drain it dry

I Round-robin

Page 327: Scrape the Web: Strategies for programming websites that donâ€™t

Return to JavaScript: breaking Hash Cash

Page 328: Scrape the Web: Strategies for programming websites that donâ€™t

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Page 329: Scrape the Web: Strategies for programming websites that donâ€™t

Page 330: Scrape the Web: Strategies for programming websites that donâ€™t

Page 331: Scrape the Web: Strategies for programming websites that donâ€™t

Page 332: Scrape the Web: Strategies for programming websites that donâ€™t

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

Page 333: Scrape the Web: Strategies for programming websites that donâ€™t

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

Page 334: Scrape the Web: Strategies for programming websites that donâ€™t

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

Page 335: Scrape the Web: Strategies for programming websites that donâ€™t

DOMForm

I Good news

I Bad news

Page 336: Scrape the Web: Strategies for programming websites that donâ€™t

DOMForm

I Good news

I Bad news

Page 337: Scrape the Web: Strategies for programming websites that donâ€™t

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Page 338: Scrape the Web: Strategies for programming websites that donâ€™t

python-spidermonkey

I Good news

I Bad news

Page 339: Scrape the Web: Strategies for programming websites that donâ€™t

python-spidermonkey

I Good news

I Bad news

Page 340: Scrape the Web: Strategies for programming websites that donâ€™t

python-spidermonkey

I Good news

I Bad news

Page 341: Scrape the Web: Strategies for programming websites that donâ€™t

python-spidermonkey

I Good news

I Bad news

Page 342: Scrape the Web: Strategies for programming websites that donâ€™t

python-spidermonkey

I Good news

I Bad news

Page 343: Scrape the Web: Strategies for programming websites that donâ€™t

Ick

I None of this is as clean and automated as mechanize.

Page 344: Scrape the Web: Strategies for programming websites that donâ€™t

Ick

I None of this is as clean and automated as mechanize.

Page 345: Scrape the Web: Strategies for programming websites that donâ€™t

“Breaking” CAPTCHAs

Page 346: Scrape the Web: Strategies for programming websites that donâ€™t

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Page 347: Scrape the Web: Strategies for programming websites that donâ€™t

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Page 348: Scrape the Web: Strategies for programming websites that donâ€™t

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Page 349: Scrape the Web: Strategies for programming websites that donâ€™t

Page 350: Scrape the Web: Strategies for programming websites that donâ€™t

Page 351: Scrape the Web: Strategies for programming websites that donâ€™t

Page 352: Scrape the Web: Strategies for programming websites that donâ€™t

Page 353: Scrape the Web: Strategies for programming websites that donâ€™t

Page 354: Scrape the Web: Strategies for programming websites that donâ€™t

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

Page 355: Scrape the Web: Strategies for programming websites that donâ€™t

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

Page 356: Scrape the Web: Strategies for programming websites that donâ€™t

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

Page 357: Scrape the Web: Strategies for programming websites that donâ€™t

Page 358: Scrape the Web: Strategies for programming websites that donâ€™t

Page 359: Scrape the Web: Strategies for programming websites that donâ€™t

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

Page 360: Scrape the Web: Strategies for programming websites that donâ€™t

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

Page 361: Scrape the Web: Strategies for programming websites that donâ€™t

The website from Hell: US PTO Public PAIR

http://portal.uspto.gov/external/portal/pair

Page 362: Scrape the Web: Strategies for programming websites that donâ€™t

Start with a CAPTCHA

Page 363: Scrape the Web: Strategies for programming websites that donâ€™t

Solve it and move on to...

I document.write()

Page 364: Scrape the Web: Strategies for programming websites that donâ€™t

Solve it and move on to...

I document.write()

Page 365: Scrape the Web: Strategies for programming websites that donâ€™t

The page is invisible.

Page 366: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 367: Scrape the Web: Strategies for programming websites that donâ€™t

Page 368: Scrape the Web: Strategies for programming websites that donâ€™t

Selenium Remote Control

examples/seleniumrc/start.py

Page 369: Scrape the Web: Strategies for programming websites that donâ€™t

Selenium IDE

I Our friend, XPath

I FireBug

Page 370: Scrape the Web: Strategies for programming websites that donâ€™t

Selenium IDE

I Our friend, XPath

I FireBug

Page 371: Scrape the Web: Strategies for programming websites that donâ€™t

Selenium IDE

I Our friend, XPath

I FireBug

Page 372: Scrape the Web: Strategies for programming websites that donâ€™t

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Page 373: Scrape the Web: Strategies for programming websites that donâ€™t

I Flexibility

Page 374: Scrape the Web: Strategies for programming websites that donâ€™t

I Flexibility

Page 375: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 376: Scrape the Web: Strategies for programming websites that donâ€™t

Other tricks

Page 377: Scrape the Web: Strategies for programming websites that donâ€™t

Your parser may fail

Page 378: Scrape the Web: Strategies for programming websites that donâ€™t

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Page 379: Scrape the Web: Strategies for programming websites that donâ€™t

Text encoding

I Try UTF-8!

Page 380: Scrape the Web: Strategies for programming websites that donâ€™t

Text encoding

I Try UTF-8!

Page 381: Scrape the Web: Strategies for programming websites that donâ€™t

Text encoding

I Try UTF-8!

Page 382: Scrape the Web: Strategies for programming websites that donâ€™t

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

Page 383: Scrape the Web: Strategies for programming websites that donâ€™t

Page 384: Scrape the Web: Strategies for programming websites that donâ€™t

Page 385: Scrape the Web: Strategies for programming websites that donâ€™t

table2dict

I Python bug tracker

Page 386: Scrape the Web: Strategies for programming websites that donâ€™t

table2dict

I Python bug tracker

Page 387: Scrape the Web: Strategies for programming websites that donâ€™t

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

Parser redux

Countermeasures

Other tricks

Conclusions

Page 388: Scrape the Web: Strategies for programming websites that donâ€™t

Conclusions

Page 389: Scrape the Web: Strategies for programming websites that donâ€™t

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Page 390: Scrape the Web: Strategies for programming websites that donâ€™t

Page 391: Scrape the Web: Strategies for programming websites that donâ€™t

Page 392: Scrape the Web: Strategies for programming websites that donâ€™t

Page 393: Scrape the Web: Strategies for programming websites that donâ€™t

Page 394: Scrape the Web: Strategies for programming websites that donâ€™t

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Page 395: Scrape the Web: Strategies for programming websites that donâ€™t

Summary

Page 396: Scrape the Web: Strategies for programming websites that donâ€™t

Summary

Page 397: Scrape the Web: Strategies for programming websites that donâ€™t

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Page 398: Scrape the Web: Strategies for programming websites that donâ€™t

Future directions

I More automation

Page 399: Scrape the Web: Strategies for programming websites that donâ€™t

Future directions

I More automation

Page 400: Scrape the Web: Strategies for programming websites that donâ€™t

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Page 401: Scrape the Web: Strategies for programming websites that donâ€™t

Bonus time

If we have time:

Page 402: Scrape the Web: Strategies for programming websites that donâ€™t

Bonus time

If we have time:

Page 403: Scrape the Web: Strategies for programming websites that donâ€™t

Bonus time

If we have time:

Download - Scrape the Web: Strategies for programming websites that donâ€™t

Top Related