![Page 1: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/1.jpg)
Web CrawlersOct 28, 2010
Wednesday, November 3, 2010
![Page 2: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/2.jpg)
What’s a website?
Wednesday, November 3, 2010
![Page 3: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/3.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Basic crawler opera-onBegin with known “seed” URLsFetch and parse themExtract URLs they point toPlace the extracted URLs on a queue
Fetch each URL on the queue and repeat
Sec. 20.2
Wednesday, November 3, 2010
![Page 4: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/4.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Crawling picture
Web
URLs crawledand parsed
URLs frontier
Unseen Web
Seedpages
Sec. 20.2
Wednesday, November 3, 2010
![Page 5: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/5.jpg)
How do we determine the seed
URLS?
Wednesday, November 3, 2010
![Page 6: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/6.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Simple picture – complica-ons Web crawling isn’t feasible with one machine
All of the above steps distributed Malicious pages
Spam pages Spider traps – incl dynamically generated
Even non-‐malicious pages pose challenges Latency/bandwidth to remote servers varyWebmasters’ s-pula-ons
How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages
Politeness – don’t hit a server too oOen
Sec. 20.1.1
Wednesday, November 3, 2010
![Page 7: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/7.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
What any crawler must doBe Polite: Respect implicit and explicit politeness considera-onsOnly crawl allowed pagesRespect robots.txt (more on this shortly)
Be Robust: Be immune to spider traps and other malicious behavior from web servers
Sec. 20.1.1
Wednesday, November 3, 2010
![Page 8: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/8.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
What any crawler should do Be capable of distributed opera-on: designed to run on mul-ple distributed machines
Be scalable: designed to increase the crawl rate by adding more machines
Performance/efficiency: permit full use of available processing and network resources
Sec. 20.1.1
Wednesday, November 3, 2010
![Page 9: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/9.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
What any crawler should doFetch pages of “higher quality” firstCon-nuous opera-on: Con-nue fetching fresh copies of a previously fetched page
Extensible: Adapt to new data formats, protocols
Sec. 20.1.1
Wednesday, November 3, 2010
![Page 10: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/10.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Updated crawling picture
URLs crawledand parsed
Unseen Web
SeedPages
URL frontier
Crawling thread
Sec. 20.1.1
Wednesday, November 3, 2010
![Page 11: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/11.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
URL fron-erCan include mul-ple pages from the same host
Must avoid trying to fetch them all at the same -me
Must try to keep all crawling threads busy
Sec. 20.2
Wednesday, November 3, 2010
![Page 12: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/12.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Explicit and implicit politenessExplicit politeness: specifica-ons from webmasters on what por-ons of site can be crawledrobots.txt
Implicit politeness: even with no specifica-on, avoid hiYng any site too oOen
Sec. 20.2
Wednesday, November 3, 2010
![Page 13: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/13.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from 1994www.robotstxt.org/wc/norobots.html
Website announces its request on what can(not) be crawledFor a URL, create a file URL/robots.txtThis file specifies access restric-ons
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 14: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/14.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Robots.txt example No robot should visit any URL star-ng with "/yoursite/temp/", except the robot called “searchengine":
User-agent: *Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 15: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/15.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Processing steps in crawling Pick a URL from the fron-er Fetch the document at the URL Parse the URL
Extract links from it to other docs (URLs)
Check if URL has content already seen If not, add to indexes
For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the fron-er (duplicate URL elimina-on)
E.g., only crawl .edu, obey robots.txt, etc.
Which one?
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 16: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/16.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Basic crawl architecture
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 17: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/17.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
DNS (Domain Name Server) A lookup service on the internet
Given a URL, retrieve its IP address Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds)
Common OS implementa-ons of DNS lookup are blocking: only one outstanding request at a -me
Solu-ons DNS caching Batch DNS resolver – collects requests and sends them out together
Sec. 20.2.2
Wednesday, November 3, 2010
![Page 18: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/18.jpg)
DNS dig +trace www.djp3.net
Root Name Server
.netName Server
djp3.netName Server
Where is www.djp3.net?
Ask 192.5.6.30
{A}.ROOT-SERVERS.NET = 198.41.0.4
{A}.GTLD-SERVERS.net = 192.5.6.30
Ask 72.1.140.145
{ns1}.speakeasy.net =72.1.140.145
Use 69.17.116.124
Give me a web page
www.djp3.net = 69.17.116.124
1
2
3
4
Wednesday, November 3, 2010
![Page 19: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/19.jpg)
Wednesday, November 3, 2010
![Page 20: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/20.jpg)
Question
• How do I know if I’ve seen this before?
• Am I stuck in a loop?
Wednesday, November 3, 2010
![Page 21: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/21.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Parsing: URL normaliza-on
When a fetched document is parsed, some of the extracted links are rela)ve URLs
E.g., at hcp://en.wikipedia.org/wiki/Main_Pagewe have a rela-ve link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL hcp://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
During parsing, must normalize (expand) such rela-ve URLs
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 22: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/22.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Content seen?Duplica-on is widespread on the webIf the page just fetched is already in the index, do not further process itThis is verified using document fingerprints or shinglesA type of hashing scheme
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 23: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/23.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Filters and robots.txt
Filters – regular expressions for URL’s to be crawled/not
Once a robots.txt file is fetched from a site, need not fetch it repeatedlyDoing so burns bandwidth, hits web server
Cache robots.txt files
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 24: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/24.jpg)
Duplicate elimination
• One-time crawl:
• Test to see if an extracted,parsed, filtered URL
• has already been sent to frontier
• has already been indexed
Wednesday, November 3, 2010
![Page 25: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/25.jpg)
Duplicate elimination
• Continuos Crawl:
• Update the URL’s priority
• staleness
• quality
• politeness
Wednesday, November 3, 2010
![Page 26: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/26.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Distribu-ng the crawler Run mul-ple crawl threads, under different processes – poten-ally at different nodesGeographically distributed nodes
Par--on hosts being crawled into nodesHash used for par--on
How do these nodes communicate?
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 27: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/27.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Communica-on between nodes
Sec. 20.2.1
Wednesday, November 3, 2010
![Page 28: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/28.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
URL fron-er: two main considera-ons
Politeness: do not hit a web server too frequently Freshness: crawl some pages more oOen than othersE.g., pages (such as News sites) whose content changes oOen
These goals may conflict each other.(E.g., simple priority queue fails – many links out of a page go to its own site, crea-ng a burst of accesses to that site.)
Sec. 20.2.3
Wednesday, November 3, 2010
![Page 29: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/29.jpg)
Introduc)on to Informa)on Retrieval
From Christopher Manning and Prabhakar Raghavan
Politeness – challengesEven if we restrict only one thread to fetch from a host, can hit it repeatedly
Common heuris-c: insert -me gap between successive requests to a host that is >> -me for most recent fetch from that host
Sec. 20.2.3
Wednesday, November 3, 2010
![Page 30: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/30.jpg)
Exercise...
Wednesday, November 3, 2010
![Page 31: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/31.jpg)
Crawl a site
Wednesday, November 3, 2010
![Page 32: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/32.jpg)
Ethics
Wednesday, November 3, 2010
![Page 33: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan](https://reader033.vdocument.in/reader033/viewer/2022042018/5e7605ea23f8f702aa4a7d7b/html5/thumbnails/33.jpg)
What should I crawl?
• robots.txt
• Facebook pages?
• Change in Service
• Terms of Service (TOS?)
Wednesday, November 3, 2010