keith enlow mobile heritrix mobile. introduction heritrix 3.1 mobile finder web service 2 options...
TRANSCRIPT
Keith Enlow
Heritrix Mobile
IntroductionHeritrix 3.1Mobile Finder Web Service2 Options
Crawl desktop web pages (default)Crawl mobile web pages using Mobile finder
and exclude mobile web pages that use media queries.
ExperimentDecision Making HeritrixWeb Service (Mobile Finder) Heritrix
Modified Heritrix 3.1 to include two options for crawlingOption 0: Crawl with desktop user agentOption 1: Crawl with mobile user agent using Mobile
FinderAdded built in mobile user agent adapted from
Google BotCrawled a small set of URLsUsed Mobile Finder to find if the given URL
has mobile versionWrote a small script to discover differences
between the mobile and desktop versions
<property name="userAgentTemplate"value="Mozilla/5.0 (compatible; heritrix/@VERISON@+ @OPERATOR_CONTACT_URL@)"/>
<property name="userAgentTemplateMobile"value="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us)
AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117
Safari/6531.22.7 (compatible; heritrix/@VERSION@+ @OPERATOR_CONTACT_URL@"/>
<!-- Option # = Description 0 [Default] Crawl using desktop user agent 1 Crawl using mobile user agent + Mobile Finder Web Service
--><property name="CrawlOption" value="0" />
URLs Crawled
Desktop URL Mobile URLwww.huffingtonpost.
comwww.foxnews.comwww.nbcnews.comwww.whitehouse.go
vwww.nasa.govwww.ssa.govwww.cornell.eduwww.stanford.eduwww.mit.edu
m.huffpost.com foxnews.mobiwww.nbcnews.comm.whitehouse.govmobile.nasa.govwww.ssa.gov/mobilem.cornell.edu/#homem.stanford.edum.mit.edu /
mobile.mit.edu
Redirection/Delivery200 Response (server side redirect)302 “Temporary” relocation301 “Permanent” relocationJavaScript Redirection (client side redirect)Media QueriesStyle Sheets
Tiny LimitsNo JavaScript Engine
Heritrix is unable to perform and execute JavaScript code
Unable to catch client side redirection and will instead continue to crawl the desktop version of the web page.
Note: The Mobile Finder Web Service will find the mobile page and therefore Heritrix will continue the crawl.
www.nasa.govwww.ssa.govwww.cornell.edu
Desktop vs MobileTotal Link Count
Hufington Fox News NBC News NASA SSA White House Stanford Cornell MIT56774 12703 8894 4960 2380 8121 2351 2901 120
2134 110 3545 63 53 570 116 94 124
HTML Distribution
Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT11550 2681 2302 851 20 3251 385 596 12
493 35 488 18 0 76 16 31 26
JavaScript Distribution
Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT245 107 46 589 12 83 104 525 2
33 4 14 8 0 13 4 8 0
CSS Distribution
Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT587 301 72 304 1 154 214 86 3
36 3 17 1 0 19 8 4 3
Image Distribution
Huffington Fox News NBC NASA SSA
White House Stanford Cornell MIT
38671 8893 5852 2908 17 4187 1460 1484 871227 59 2769 28 0 436 74 4 89
FIN