![Page 1: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/1.jpg)
Scrape the Web: Strategies for programmingwebsites that don’t expect it
Presenter: Asheesh Laroia, @asheeshlaroia([email protected], +1-585-506-8865)
February 18, 2010
![Page 2: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/2.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 3: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/3.jpg)
Intro
![Page 4: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/4.jpg)
Meta
![Page 5: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/5.jpg)
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
![Page 6: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/6.jpg)
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
![Page 7: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/7.jpg)
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
![Page 8: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/8.jpg)
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
![Page 9: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/9.jpg)
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
![Page 10: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/10.jpg)
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
![Page 11: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/11.jpg)
Format introduction
I I’ll stand up here and talk about things.
I You’ll ask me questions.
![Page 12: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/12.jpg)
Format introduction
I I’ll stand up here and talk about things.
I You’ll ask me questions.
![Page 13: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/13.jpg)
Format introduction
I I’ll stand up here and talk about things.
I You’ll ask me questions.
![Page 14: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/14.jpg)
You know what sucks?
I It sucks when everyone’s thinking something and nobody’ssaying it.
I If I am incoherent, stop me.
![Page 15: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/15.jpg)
You know what sucks?
I It sucks when everyone’s thinking something and nobody’ssaying it.
I If I am incoherent, stop me.
![Page 16: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/16.jpg)
You know what sucks?
I It sucks when everyone’s thinking something and nobody’ssaying it.
I If I am incoherent, stop me.
![Page 17: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/17.jpg)
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
![Page 18: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/18.jpg)
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
![Page 19: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/19.jpg)
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
![Page 20: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/20.jpg)
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
![Page 21: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/21.jpg)
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
![Page 22: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/22.jpg)
What is screen scraping?
![Page 23: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/23.jpg)
Photo
![Page 24: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/24.jpg)
Photo
![Page 25: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/25.jpg)
Brittle?
![Page 26: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/26.jpg)
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
![Page 27: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/27.jpg)
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
![Page 28: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/28.jpg)
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
![Page 29: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/29.jpg)
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
![Page 30: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/30.jpg)
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
![Page 31: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/31.jpg)
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
![Page 32: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/32.jpg)
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
![Page 33: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/33.jpg)
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
![Page 34: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/34.jpg)
Independence
I Design choices and restrictions fall away.
![Page 35: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/35.jpg)
Independence
I Design choices and restrictions fall away.
![Page 36: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/36.jpg)
Power, too much
I WE CAN SEND SPAM!
I Don’t do that.
![Page 37: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/37.jpg)
Power, too much
I WE CAN SEND SPAM!
I Don’t do that.
![Page 38: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/38.jpg)
Power, too much
I WE CAN SEND SPAM!
I Don’t do that.
![Page 39: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/39.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 40: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/40.jpg)
Programming the web
![Page 41: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/41.jpg)
Say
![Page 42: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/42.jpg)
The Web
I It’s the twenty-first century.
I The Web is a massive, mostly-unrestricted remote procedurecall system.
![Page 43: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/43.jpg)
The Web
I It’s the twenty-first century.
I The Web is a massive, mostly-unrestricted remote procedurecall system.
![Page 44: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/44.jpg)
The Web
I It’s the twenty-first century.
I The Web is a massive, mostly-unrestricted remote procedurecall system.
![Page 45: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/45.jpg)
Mac OS “say”
I I’m not hip enough to have “say”
I but I do have the Web
![Page 46: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/46.jpg)
Mac OS “say”
I I’m not hip enough to have “say”
I but I do have the Web
![Page 47: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/47.jpg)
Mac OS “say”
I I’m not hip enough to have “say”
I but I do have the Web
![Page 48: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/48.jpg)
Cepstral demo
![Page 49: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/49.jpg)
Curry
![Page 50: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/50.jpg)
Delicious
![Page 51: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/51.jpg)
Curry on the web
http://mehfilindian.com/LunchMenuTakeOut.htm
![Page 52: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/52.jpg)
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
![Page 53: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/53.jpg)
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
![Page 54: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/54.jpg)
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
![Page 55: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/55.jpg)
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
![Page 56: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/56.jpg)
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
![Page 57: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/57.jpg)
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
![Page 58: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/58.jpg)
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
![Page 59: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/59.jpg)
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
![Page 60: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/60.jpg)
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
![Page 61: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/61.jpg)
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
![Page 62: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/62.jpg)
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
![Page 63: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/63.jpg)
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
![Page 64: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/64.jpg)
XHTML and HTML
I It’s 2010.
I Surely XHTML has won by now.
![Page 65: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/65.jpg)
XHTML and HTML
I It’s 2010.
I Surely XHTML has won by now.
![Page 66: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/66.jpg)
XHTML and HTML
I It’s 2010.
I Surely XHTML has won by now.
![Page 67: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/67.jpg)
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
![Page 68: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/68.jpg)
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
![Page 69: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/69.jpg)
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
![Page 70: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/70.jpg)
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
![Page 71: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/71.jpg)
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
![Page 72: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/72.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 73: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/73.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 74: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/74.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?
I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 75: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/75.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 76: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/76.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?
I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 77: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/77.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 78: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/78.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:
I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 79: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/79.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 80: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/80.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?
I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 81: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/81.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 82: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/82.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?
I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 83: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/83.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 84: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/84.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?
I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 85: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/85.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 86: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/86.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 87: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/87.jpg)
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
![Page 88: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/88.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 89: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/89.jpg)
The web: Round one
![Page 90: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/90.jpg)
Parsing considerations
![Page 91: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/91.jpg)
A showcase of some of your options
I An example of valid HTML (written by hand)(examples/parsing/)
I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 92: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/92.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)
I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 93: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/93.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 94: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/94.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 95: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/95.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 96: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/96.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 97: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/97.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidom
I Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 98: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/98.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 99: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/99.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 100: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/100.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in Firefox
I In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 101: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/101.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidom
I in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 102: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/102.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 103: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/103.jpg)
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
![Page 104: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/104.jpg)
Other ways to get information out of web pages?
I “squash” in page contents.lower()
I re.search(“squash”, page contents, re.IGNORECASE)
![Page 105: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/105.jpg)
Other ways to get information out of web pages?
I “squash” in page contents.lower()
I re.search(“squash”, page contents, re.IGNORECASE)
![Page 106: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/106.jpg)
Other ways to get information out of web pages?
I “squash” in page contents.lower()
I re.search(“squash”, page contents, re.IGNORECASE)
![Page 107: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/107.jpg)
Inspirational quote: JWZ
Some people, when confronted with a problem, think“Iknow, I’ll use regular expressions.” Now they have twoproblems.– Jamie Zawinski
![Page 108: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/108.jpg)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
![Page 109: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/109.jpg)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
![Page 110: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/110.jpg)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
![Page 111: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/111.jpg)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
![Page 112: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/112.jpg)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
![Page 113: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/113.jpg)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
![Page 114: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/114.jpg)
Inspirational quote: Jon Postel
Robustness principle: “Be conservative in what you do, be liberal inwhat you accept from others.”– Jon Postel, Transmission Control Protocol, RFC 793
![Page 115: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/115.jpg)
Inspirational quote: Leonard Richardson
“You didn’t write that awful page. You’re just trying to get somedata out of it. Right now, you don’t really care what HTML issupposed to look like.“– Leonard Richardson, author of BeautifulSoup
![Page 116: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/116.jpg)
Back to curry
![Page 117: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/117.jpg)
New goal for curry: Objectify
Map the menu to Python objects
I play with the source in BeautifulSoup
I ...this is a text processing problem, not tag processing.
![Page 118: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/118.jpg)
New goal for curry: Objectify
Map the menu to Python objects
I play with the source in BeautifulSoup
I ...this is a text processing problem, not tag processing.
![Page 119: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/119.jpg)
New goal for curry: Objectify
Map the menu to Python objects
I play with the source in BeautifulSoup
I ...this is a text processing problem, not tag processing.
![Page 120: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/120.jpg)
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
![Page 121: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/121.jpg)
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
![Page 122: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/122.jpg)
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
![Page 123: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/123.jpg)
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
![Page 124: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/124.jpg)
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
![Page 125: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/125.jpg)
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
![Page 126: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/126.jpg)
Mini-lesson
I hand-written pages vs.
I machine-written pages
![Page 127: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/127.jpg)
Mini-lesson
I hand-written pages vs.
I machine-written pages
![Page 128: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/128.jpg)
Mini-lesson
I hand-written pages vs.
I machine-written pages
![Page 129: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/129.jpg)
New goal: Scrape Yahoo! finance
I examples/tree-builders/beautifulsoup yfinance.py
![Page 130: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/130.jpg)
New goal: Scrape Yahoo! finance
I examples/tree-builders/beautifulsoup yfinance.py
![Page 131: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/131.jpg)
We’re done!
Right?
![Page 132: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/132.jpg)
Trees of tags
![Page 133: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/133.jpg)
What defines how HTML gets parsed?
Web browsers
![Page 134: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/134.jpg)
Surfing tag trees in FireBug
I Or Opera Dragonfly
I Or Chrome’s Inspector
![Page 135: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/135.jpg)
Surfing tag trees in FireBug
I Or Opera Dragonfly
I Or Chrome’s Inspector
![Page 136: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/136.jpg)
Surfing tag trees in FireBug
I Or Opera Dragonfly
I Or Chrome’s Inspector
![Page 137: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/137.jpg)
Parsing trees and finding elements
![Page 138: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/138.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 139: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/139.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 140: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/140.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 141: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/141.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 142: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/142.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 143: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/143.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 144: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/144.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 145: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/145.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 146: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/146.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...
I titleI span.title
![Page 147: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/147.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I title
I span.title
![Page 148: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/148.jpg)
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
![Page 149: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/149.jpg)
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
![Page 150: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/150.jpg)
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
![Page 151: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/151.jpg)
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
![Page 152: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/152.jpg)
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
![Page 153: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/153.jpg)
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
![Page 154: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/154.jpg)
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
![Page 155: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/155.jpg)
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
![Page 156: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/156.jpg)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
![Page 157: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/157.jpg)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
![Page 158: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/158.jpg)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
![Page 159: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/159.jpg)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
![Page 160: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/160.jpg)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
![Page 161: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/161.jpg)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
![Page 162: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/162.jpg)
Interacting with the web
![Page 163: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/163.jpg)
Basic Yahoo! search (hard-coded)
examples/search/yahoo.py
![Page 164: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/164.jpg)
Basic Google! search (hard-coded)
examples/search/google.py
I Great code, but broken due to ?
![Page 165: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/165.jpg)
Basic Google! search (hard-coded)
examples/search/google.py
I Great code, but broken due to ?
![Page 166: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/166.jpg)
Something’s wrong...
![Page 167: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/167.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 168: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/168.jpg)
The web: HTTP and you
![Page 169: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/169.jpg)
A network trace of an HTTP conversation
![Page 170: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/170.jpg)
User-Agent, and other headers the client sends
![Page 171: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/171.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 172: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/172.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 173: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/173.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 174: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/174.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 175: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/175.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 176: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/176.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 177: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/177.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 178: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/178.jpg)
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
![Page 179: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/179.jpg)
HTTP methods
I GET
I POST
I PUT
I BREW
![Page 180: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/180.jpg)
HTTP methods
I GET
I POST
I PUT
I BREW
![Page 181: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/181.jpg)
HTTP methods
I GET
I POST
I PUT
I BREW
![Page 182: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/182.jpg)
HTTP methods
I GET
I POST
I PUT
I BREW
![Page 183: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/183.jpg)
HTTP methods
I GET
I POST
I PUT
I BREW
![Page 184: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/184.jpg)
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
![Page 185: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/185.jpg)
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
![Page 186: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/186.jpg)
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
![Page 187: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/187.jpg)
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
![Page 188: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/188.jpg)
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
![Page 189: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/189.jpg)
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
![Page 190: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/190.jpg)
What if we settle for approximate emulation?
![Page 191: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/191.jpg)
Re-do of Google search with a cooked user-agent
examples/search/urllib2-user-agent/google as ie.py
![Page 192: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/192.jpg)
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
![Page 193: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/193.jpg)
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
![Page 194: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/194.jpg)
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
![Page 195: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/195.jpg)
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
![Page 196: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/196.jpg)
HTTP: State via cookies
I HTTP implements state on top of TCP
![Page 197: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/197.jpg)
HTTP: State via cookies
I HTTP implements state on top of TCP
![Page 198: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/198.jpg)
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
![Page 199: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/199.jpg)
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
![Page 200: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/200.jpg)
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
![Page 201: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/201.jpg)
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
![Page 202: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/202.jpg)
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
![Page 203: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/203.jpg)
robots.txt and detectability
I “How does the server know you’re a robot?”
I Well, if you GET /robots.txt...
![Page 204: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/204.jpg)
robots.txt and detectability
I “How does the server know you’re a robot?”
I Well, if you GET /robots.txt...
![Page 205: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/205.jpg)
robots.txt and detectability
I “How does the server know you’re a robot?”
I Well, if you GET /robots.txt...
![Page 206: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/206.jpg)
Filling out more forms: POST and GET
(Be sure to pay attention to the clock; minute 90 is when snackbreak starts.)
![Page 207: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/207.jpg)
POST: Cepstral Weather demo (by hand)
http://cepstral.com/cgi-bin/demos/weather
![Page 208: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/208.jpg)
Note the URL we POST to
I from FireBug
![Page 209: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/209.jpg)
Note the URL we POST to
I from FireBug
![Page 210: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/210.jpg)
Note the data we POST
I from FireBug
![Page 211: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/211.jpg)
Note the data we POST
I from FireBug
![Page 212: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/212.jpg)
Write simple Python that also POSTs
examples/cepstral/just post.py
![Page 213: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/213.jpg)
Pull out the .wav file and play it with mplayer
examples/cepstral/play wav.py
![Page 214: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/214.jpg)
POST: Cepstral weather demo (via mechanize)
examples/cepstral/just post via mechanize.py
![Page 215: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/215.jpg)
Basic Yahoo! search (via mechanize)
examples/search/yahoo mechanize.py
I Great code, but broken due to robots.txt
![Page 216: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/216.jpg)
Basic Yahoo! search (via mechanize)
examples/search/yahoo mechanize.py
I Great code, but broken due to robots.txt
![Page 217: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/217.jpg)
Basic Yahoo! search (via mechanize, handle robots=False)
examples/search/yahoo mechanize norobots.py
![Page 218: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/218.jpg)
Basic Google! search (via mechanize,handle robots=False, changeuser-agent)
examples/search/google mechanize.py
![Page 219: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/219.jpg)
Cookies
![Page 220: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/220.jpg)
emusic: Log in and verify that we logged in successfully(with cookielib)(optional)
examples/cookies/emusic login byhand.py
![Page 221: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/221.jpg)
emusic: Log in and verify that we logged in successfully(with mechanize)
examples/cookies/emusic login mechanize.py
![Page 222: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/222.jpg)
emusic: Check how many downloads we have left (withmechanize)
examples/cookies/emusic check downloads.py
![Page 223: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/223.jpg)
Now we’re done, right?
Whew.
![Page 224: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/224.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 225: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/225.jpg)
Recap and philosophy
![Page 226: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/226.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 227: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/227.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 228: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/228.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 229: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/229.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 230: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/230.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 231: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/231.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 232: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/232.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 233: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/233.jpg)
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
![Page 234: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/234.jpg)
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
![Page 235: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/235.jpg)
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
![Page 236: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/236.jpg)
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
![Page 237: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/237.jpg)
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
![Page 238: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/238.jpg)
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
![Page 239: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/239.jpg)
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
![Page 240: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/240.jpg)
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
![Page 241: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/241.jpg)
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
![Page 242: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/242.jpg)
Web APIs
![Page 243: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/243.jpg)
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
![Page 244: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/244.jpg)
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
![Page 245: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/245.jpg)
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contacts
I status messagesI large profile imagesI notifications
I What’s the point?
![Page 246: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/246.jpg)
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messages
I large profile imagesI notifications
I What’s the point?
![Page 247: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/247.jpg)
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile images
I notifications
I What’s the point?
![Page 248: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/248.jpg)
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
![Page 249: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/249.jpg)
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
![Page 250: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/250.jpg)
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
![Page 251: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/251.jpg)
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
![Page 252: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/252.jpg)
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
![Page 253: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/253.jpg)
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
![Page 254: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/254.jpg)
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
![Page 255: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/255.jpg)
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
![Page 256: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/256.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 257: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/257.jpg)
Parser redux
![Page 258: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/258.jpg)
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
![Page 259: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/259.jpg)
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
![Page 260: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/260.jpg)
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
![Page 261: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/261.jpg)
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
![Page 262: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/262.jpg)
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTML
I HTML: 1998-style, or 2003-style?
![Page 263: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/263.jpg)
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
![Page 264: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/264.jpg)
Benchmarks by Ian Bicking
I Benchmarks run by me this morning
I same results as Ian
![Page 265: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/265.jpg)
Benchmarks by Ian BickingI Benchmarks run by me this morning
I same results as Ian
![Page 266: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/266.jpg)
Benchmarks by Ian BickingI Benchmarks run by me this morning
I same results as Ian
![Page 267: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/267.jpg)
Ease of use
![Page 268: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/268.jpg)
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
![Page 269: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/269.jpg)
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
![Page 270: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/270.jpg)
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
![Page 271: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/271.jpg)
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
![Page 272: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/272.jpg)
A winner
I lxml!
I ...?
![Page 273: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/273.jpg)
A winner
I lxml!
I ...?
![Page 274: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/274.jpg)
A winner
I lxml!
I ...?
![Page 275: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/275.jpg)
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
![Page 276: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/276.jpg)
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
![Page 277: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/277.jpg)
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
![Page 278: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/278.jpg)
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
![Page 279: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/279.jpg)
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
![Page 280: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/280.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 281: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/281.jpg)
Countermeasures
![Page 282: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/282.jpg)
Easy
![Page 283: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/283.jpg)
Imagine a really stupid bot
![Page 284: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/284.jpg)
Check Referer header
I mechanize solves this
![Page 285: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/285.jpg)
Check Referer header
I mechanize solves this
![Page 286: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/286.jpg)
Extra hidden form fields
I mechanize solves this
![Page 287: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/287.jpg)
Extra hidden form fields
I mechanize solves this
![Page 288: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/288.jpg)
Requiring cookies
I mechanize solves this
![Page 289: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/289.jpg)
Requiring cookies
I mechanize solves this
![Page 290: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/290.jpg)
Countermeasures: hard
![Page 291: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/291.jpg)
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
![Page 292: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/292.jpg)
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
![Page 293: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/293.jpg)
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, or
I your own machines
I Use SOCKS (plus SSH) to make this easy
![Page 294: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/294.jpg)
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
![Page 295: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/295.jpg)
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
![Page 296: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/296.jpg)
CAPTCHAs
Example: Google web search (when you exceed undeclared querylimits).
I uh-oh
![Page 297: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/297.jpg)
CAPTCHAs
Example: Google web search (when you exceed undeclared querylimits).
I uh-oh
![Page 298: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/298.jpg)
JavaScript
Example: “Hash cash” system for avoiding comment spam.
I uh-oh
![Page 299: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/299.jpg)
JavaScript
Example: “Hash cash” system for avoiding comment spam.
I uh-oh
![Page 300: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/300.jpg)
Invisible countermeasures
![Page 301: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/301.jpg)
Behavior profiling
I Time-based?
![Page 302: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/302.jpg)
Behavior profiling
I Time-based?
![Page 303: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/303.jpg)
Inserting false link visible only to bots
I “Tarpits”
![Page 304: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/304.jpg)
Inserting false link visible only to bots
I “Tarpits”
![Page 305: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/305.jpg)
robots.txt access
I As soon as you access it, you lose.
![Page 306: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/306.jpg)
robots.txt access
I As soon as you access it, you lose.
![Page 307: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/307.jpg)
Getting around IP address limits
![Page 308: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/308.jpg)
Understand
I We still have to stay within the limits. We can just takeadvantage of IPs we do have.
![Page 309: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/309.jpg)
Understand
I We still have to stay within the limits. We can just takeadvantage of IPs we do have.
![Page 310: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/310.jpg)
ssh -D
I Borrow the IP of any machine you can log in to
I ssh -D 1080 asheesh.org
![Page 311: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/311.jpg)
ssh -D
I Borrow the IP of any machine you can log in to
I ssh -D 1080 asheesh.org
![Page 312: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/312.jpg)
ssh -D
I Borrow the IP of any machine you can log in to
I ssh -D 1080 asheesh.org
![Page 313: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/313.jpg)
socks monkey
I SOCKSify Python from within Python
I examples/ip-limits/socks monkey.py
![Page 314: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/314.jpg)
socks monkey
I SOCKSify Python from within Python
I examples/ip-limits/socks monkey.py
![Page 315: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/315.jpg)
socks monkey
I SOCKSify Python from within Python
I examples/ip-limits/socks monkey.py
![Page 316: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/316.jpg)
tsocks
I SOCKSify Python via LD PRELOAD
I examples/ip-limits/tsocks/
![Page 317: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/317.jpg)
tsocks
I SOCKSify Python via LD PRELOAD
I examples/ip-limits/tsocks/
![Page 318: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/318.jpg)
tsocks
I SOCKSify Python via LD PRELOAD
I examples/ip-limits/tsocks/
![Page 319: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/319.jpg)
tor
“The onion router”
I SOCKSify but borrow someone else’s IP
I (play nice...)
![Page 320: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/320.jpg)
tor
“The onion router”
I SOCKSify but borrow someone else’s IP
I (play nice...)
![Page 321: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/321.jpg)
tor
“The onion router”
I SOCKSify but borrow someone else’s IP
I (play nice...)
![Page 322: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/322.jpg)
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
![Page 323: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/323.jpg)
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
![Page 324: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/324.jpg)
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
![Page 325: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/325.jpg)
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
![Page 326: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/326.jpg)
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
![Page 327: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/327.jpg)
Return to JavaScript: breaking Hash Cash
![Page 328: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/328.jpg)
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
![Page 329: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/329.jpg)
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
![Page 330: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/330.jpg)
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
![Page 331: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/331.jpg)
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
![Page 332: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/332.jpg)
Rewriting the JavaScript as Python
I You may think I’m joking, but this is a common strategy.
![Page 333: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/333.jpg)
Rewriting the JavaScript as Python
I You may think I’m joking, but this is a common strategy.
![Page 334: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/334.jpg)
DOMForm
I Good news
“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize
I Bad news
“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.
![Page 335: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/335.jpg)
DOMForm
I Good news
“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize
I Bad news
“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.
![Page 336: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/336.jpg)
DOMForm
I Good news
“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize
I Bad news
“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.
![Page 337: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/337.jpg)
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
![Page 338: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/338.jpg)
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
![Page 339: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/339.jpg)
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
![Page 340: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/340.jpg)
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
![Page 341: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/341.jpg)
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
![Page 342: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/342.jpg)
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
![Page 343: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/343.jpg)
Ick
I None of this is as clean and automated as mechanize.
![Page 344: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/344.jpg)
Ick
I None of this is as clean and automated as mechanize.
![Page 345: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/345.jpg)
“Breaking” CAPTCHAs
![Page 346: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/346.jpg)
Fallback: yourself
I Can always just prompt the operator to figure it out and enterit
![Page 347: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/347.jpg)
Fallback: yourself
I Can always just prompt the operator to figure it out and enterit
![Page 348: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/348.jpg)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
![Page 349: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/349.jpg)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
![Page 350: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/350.jpg)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
![Page 351: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/351.jpg)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
![Page 352: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/352.jpg)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
![Page 353: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/353.jpg)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
![Page 354: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/354.jpg)
Audio captchas: “Simple” signal analysis
I Should be doable in pylab/matplotlib with fast Fouriertransforms
![Page 355: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/355.jpg)
Audio captchas: “Simple” signal analysis
I Should be doable in pylab/matplotlib with fast Fouriertransforms
![Page 356: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/356.jpg)
JavaScript CAPTCHAs (like reCAPTCHA)
I re-implement CAPTCHA-downloading logic in Python
I ...or execute the JavaScript with spidermonkey
![Page 357: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/357.jpg)
JavaScript CAPTCHAs (like reCAPTCHA)
I re-implement CAPTCHA-downloading logic in Python
I ...or execute the JavaScript with spidermonkey
![Page 358: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/358.jpg)
JavaScript CAPTCHAs (like reCAPTCHA)
I re-implement CAPTCHA-downloading logic in Python
I ...or execute the JavaScript with spidermonkey
![Page 359: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/359.jpg)
...JDownloader
I “Again, our captcha team did a great job and implementedmany new captcha methods.”
![Page 360: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/360.jpg)
...JDownloader
I “Again, our captcha team did a great job and implementedmany new captcha methods.”
![Page 361: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/361.jpg)
The website from Hell: US PTO Public PAIR
http://portal.uspto.gov/external/portal/pair
![Page 362: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/362.jpg)
Start with a CAPTCHA
![Page 363: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/363.jpg)
Solve it and move on to...
I document.write()
![Page 364: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/364.jpg)
Solve it and move on to...
I document.write()
![Page 365: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/365.jpg)
The page is invisible.
![Page 366: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/366.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 367: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/367.jpg)
Automating the web browser
![Page 368: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/368.jpg)
Selenium Remote Control
examples/seleniumrc/start.py
![Page 369: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/369.jpg)
Selenium IDE
I Our friend, XPath
I FireBug
![Page 370: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/370.jpg)
Selenium IDE
I Our friend, XPath
I FireBug
![Page 371: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/371.jpg)
Selenium IDE
I Our friend, XPath
I FireBug
![Page 372: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/372.jpg)
Why don’t we just do this all the time?
I Firefox memory footprint
I Flexibility
![Page 373: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/373.jpg)
Why don’t we just do this all the time?
I Firefox memory footprint
I Flexibility
![Page 374: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/374.jpg)
Why don’t we just do this all the time?
I Firefox memory footprint
I Flexibility
![Page 375: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/375.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 376: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/376.jpg)
Other tricks
![Page 377: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/377.jpg)
Your parser may fail
![Page 378: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/378.jpg)
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
![Page 379: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/379.jpg)
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
![Page 380: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/380.jpg)
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
![Page 381: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/381.jpg)
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
![Page 382: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/382.jpg)
Automatically reverse-engineer templates
I templatemaker by Adrian Holovaty
I everyblock templatemaker
![Page 383: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/383.jpg)
Automatically reverse-engineer templates
I templatemaker by Adrian Holovaty
I everyblock templatemaker
![Page 384: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/384.jpg)
Automatically reverse-engineer templates
I templatemaker by Adrian Holovaty
I everyblock templatemaker
![Page 385: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/385.jpg)
table2dict
I Python bug tracker
![Page 386: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/386.jpg)
table2dict
I Python bug tracker
![Page 387: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/387.jpg)
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
![Page 388: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/388.jpg)
Conclusions
![Page 389: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/389.jpg)
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
![Page 390: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/390.jpg)
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
![Page 391: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/391.jpg)
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
![Page 392: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/392.jpg)
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
![Page 393: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/393.jpg)
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
![Page 394: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/394.jpg)
Summary
I If it’s on a web page, you can scrape it out.
I “Now you have an API for everything.”
![Page 395: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/395.jpg)
Summary
I If it’s on a web page, you can scrape it out.
I “Now you have an API for everything.”
![Page 396: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/396.jpg)
Summary
I If it’s on a web page, you can scrape it out.
I “Now you have an API for everything.”
![Page 397: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/397.jpg)
Future directions
I More automation
I Using cssselect everywhere, geez it’s cool
![Page 398: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/398.jpg)
Future directions
I More automation
I Using cssselect everywhere, geez it’s cool
![Page 399: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/399.jpg)
Future directions
I More automation
I Using cssselect everywhere, geez it’s cool
![Page 400: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/400.jpg)
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions
![Page 401: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/401.jpg)
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions
![Page 402: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/402.jpg)
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions
![Page 403: Scrape the Web: Strategies for programming websites that don’t](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb48352e268c58cd5c52a0/html5/thumbnails/403.jpg)
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions