web::scraper
TRANSCRIPT
Practical Web Scraping
with Web::Scraper
Tatsuhiko Miyagawa [email protected]
Six Apart, Ltd. / Shibuya Perl MongersYAPC::Europe 2007 Vienna
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko Miyagawa
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
CPAN: MIYAGAWA
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords
Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny
Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton
Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC
Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager
Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era
Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode
File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent
HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote
Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp
Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find
PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable
Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript
Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon
Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape
WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery
WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog
XML::Atom::Stream XML::Liberal
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
http://code.sixapart.com/
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Practical Web Scraping
with Web::Scraper
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.
http://en.wikipedia.org/wiki/Screen_scraping
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.
http://en.wikipedia.org/wiki/Screen_scraping
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
"Screen-scrapingis so 1999!"
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
RSS is a metadatanot a complete
HTML replacement
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Practical Web Scraping
with Web::Scraper
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
What's wrong withLWP & Regexp?
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
It works!
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
WWW::MySpace 0.70
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
WWW::Search::Ebay 2.231
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
WWW::Mixi 0.50
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
It works …
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
There are3 problems(at least)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
(1)Fragile
Easy to break even with slight HTML changes(like newlines, order of attributes etc.)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
(2)Hard to maintain
Regular expression based scrapers are good Only when they're used in write-only scripts
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
(3)Improper
HTML & encodinghandling
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
<span class="message">I ♥ Vienna</span>
> perl –e '$c =~ m@<span class="message">(.*?)</span>@ and print $1'I ♥ Vienna
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
<span class="message">I ♥ Vienna</span>
> perl –MHTML::Entities –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities($1)'I ♥ Vienna
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
<span class="message"> ウィーンが大好き! </span>
> perl –MHTML::Entities –MEncode –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities(decode_utf8($1))'Wide character in print at –e line 1.ウィーンが大好き!
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
The "right" wayof screen-scraping
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
(1), (2)MaintainableLess fragile
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Use XPathand CSS Selectors
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
XPath
HTML::TreeBuilder::XPathXML::LibXML
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
XPath
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);print $tree->findnodes('//strong[@id="ctu"]')->shift->as_text;
# Monday, August 27, 2007 at 12:49:46
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
CSS Selectors
"XPath for HTML coders""XPath for people who hates XML"
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
CSS Selectors
body { font-size: 12px; }
div.article { padding: 1em }
span#count { color: #fff }
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
XPath: //strong[@id="ctu"]
CSS Selector: strong#ctu
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
CSS Selectors
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath "strong#ctu";print $tree->findnodes($xpath)->shift->as_text;
# Monday, August 27, 2007 at 12:49:46
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Complete Script#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Robust,Maintainable,
andSane character
handling
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Exmaple (before)
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Example (after)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
but …long and boring
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Practical Web Scraping
with Web::Scraper
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Web scraping toolkitinspired by scrapi.rb
DSL-ish
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Example (before)#!/usr/bin/perluse strict;use warnings;use Encode;use LWP::UserAgent;use HTTP::Response::Encoding;use HTML::TreeBuilder::XPath;use HTML::Selector::XPath qw(selector_to_xpath);
my $ua = LWP::UserAgent->new;my $res = $ua->get("http://www.timeanddate.com/worldclock/");if ($res->is_error) { die "HTTP GET error: ", $res->status_line;}my $content = decode $res->encoding, $res->content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);my $xpath = selector_to_xpath("strong#ctu");my $node = $tree->findnodes($xpath)->shift;print $node->as_text;
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Example (after)
#!/usr/bin/perl
use strict;
use warnings;
use Web::Scraper;
use URI;
my $s = scraper {
process "strong#ctu", time => 'TEXT';
result 'time';
};
my $uri = URI->new("http://timeanddate.com/worldclock/");
print $s->scrape($uri);
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Basics
use Web::Scraper;
my $s = scraper {
# DSL goes here
};
my $res = $s->scrape($uri);
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
process
process $selector,
$key => $what,
…;
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
$selector:
CSS Selectoror
XPath (start with /)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
$key:key for the result
hashappend "[]" for
looping
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
$what:'@attr''TEXT'
Web::Scrapersub { … }
Hash reference
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
process "ul.sites > li > a",
'urls[]' => '@href';
# { urls => [ … ] }
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
process '//ul[@class="sites"]/li/a',
'names[]' => 'TEXT';
# { names => [ 'OpenGuides', … ] }
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
process "ul.sites > li",
'sites[]' => scraper {
process 'a',
link => '@href', name => 'TEXT';
};
# { sites => [ { link => …, name => … },
# { link => …, name => … } ] };
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
process "ul.sites > li > a",
'sites[]' => sub {
# $_ is HTML::Element
+{ link => $_->attr('href'), name => $_->as_text };
};
# { sites => [ { link => …, name => … },
# { link => …, name => … } ] };
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
process "ul.sites > li > a",
'sites[]' => {
link => '@href', name => 'TEXT';
};
# { sites => [ { link => …, name => … },
# { link => …, name => … } ] };
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
result
result; # get stash as hashref (default)result @keys; # get stash as hashref containing @keysresult $key; # get value of stash $key;
my $s = scraper { process …; process …; result 'foo', 'bar';};
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
More Examples
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Thumbnail URLs on Flickr set
#!/usr/bin/perl
use strict;
use Data::Dumper;
use Web::Scraper;
use URI;
my $url = "http://flickr.com/photos/bulknews/sets/72157601700510359/";
my $s = scraper {
process "a.image_link img", "thumbs[]" => '@src';
};
warn Dumper $s->scrape( URI->new($url) );
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
<span class="vcard"> <a href="http://twitter.com/iamcal" class="url" rel="contact" title="Cal Henderson"> <img alt="Cal Henderson" class="photo fn" height="24" id="profile-image" src="http://assets0.twitter.com/…/mini/buddyicon.gif" width="24" /></a></span>
<span class="vcard">…</span>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Twitter Friends
#!/usr/bin/perl
use strict;
use Web::Scraper;
use URI;
use Data::Dumper;
my $url = "http://twitter.com/miyagawa";
my $s = scraper {
process "span.vcard a", "people[]" => '@title';
};
warn Dumper $s->scrape( URI->new($url) ) ;
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Twitter Friends (complex)
#!/usr/bin/perl
use strict;
use Web::Scraper;
use URI;
use Data::Dumper;
my $url = "http://twitter.com/miyagawa";
my $s = scraper {
process "span.vcard", "people[]" => scraper {
process "a", link => '@href', name => '@title';
process "img", thumb => '@src';
};
};
warn Dumper $s->scrape( URI->new($url) ) ;
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Tools
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
> cpan Web::Scraper
comes with 'scraper' CLI
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
> scraper http://example.com/
scraper> process "a", "links[]" => '@href';
scraper> d
$VAR1 = {
links => [
'http://example.org/',
'http://example.net/',
],
};
scraper> y
---
links:
- http://example.org/
- http://example.net/
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
> scraper /path/to/foo.html
> GET http://example.com/ | scraper
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
TODO
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Web::ScraperNeeds documentation
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
More examplesto put in eg/ directory
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
integrate withWWW::Mechanize
and Test::WWW::Declare
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
XPath Auto-suggestion
off of DOM + element
DOM + XPath => ElementDOM + Element => XPath?
(Template::Extract?)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Questions?
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/08/28 YAPC::Europe 20072007/08/28 YAPC::Europe 2007
Thank you
http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/
webscraper