parallel crawlers efficient url caching for world wide web crawling presenter sawood alam...
TRANSCRIPT
![Page 1: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/1.jpg)
Parallel Crawlers
Efficient URL Caching for World Wide Web Crawling
PresenterSawood [email protected]
AND
![Page 2: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/2.jpg)
Parallel Crawlers
Hector Garcia-MolinaStanford University
Junghoo ChoUniversity of California
![Page 3: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/3.jpg)
ABSTRACTDesign an effective and scalable parallel
crawlerPropose multiple architectures for a parallel
crawlerIdentify fundamental issues related to
parallel crawlingMetrics to evaluate a parallel crawlerCompare the proposed architectures using
40 million pages
![Page 4: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/4.jpg)
Challenges for parallel crawlersOverlapQualityCommunication bandwidth
![Page 5: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/5.jpg)
AdvantagesScalabilityNetwork-load dispersionNetwork-load reduction
CompressionDifferenceSummarization
![Page 6: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/6.jpg)
Related workGeneral architecturePage selectionPage update
![Page 7: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/7.jpg)
Geographical categorizationIntra-site parallel crawlerDistributed crawler
![Page 8: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/8.jpg)
CommunicationIndependentDynamic assignmentStatic assignment
![Page 9: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/9.jpg)
Crawling modes (Static)Firewall modeCross-over modeExchange mode
![Page 10: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/10.jpg)
URL exchange minimizationBatch communicationReplication
![Page 11: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/11.jpg)
Partitioning functionURL-hash basedSite-hash basedHierarchical
![Page 12: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/12.jpg)
Evaluation modelsOverlapCoverageQualityCommunication overhead
![Page 13: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/13.jpg)
Firewall mode and coverage
![Page 14: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/14.jpg)
Cross-over mode and overlap
![Page 15: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/15.jpg)
Exchange mode and communication
![Page 16: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/16.jpg)
Quality and batch communication
![Page 17: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/17.jpg)
ConclusionFirewall mode is good if processes <= 4URL exchange poses network overhead <
1%Quality is maintained even in the batch
communicationReplicating 10,000 to 100,000 popular
URLs can reduce 40% communication overhead
![Page 18: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/18.jpg)
Efficient URL Caching for World Wide Web Crawling
Andrei Z. BroderIBM TJ Watson Research Center
Janet L. WienerHewlett Packard [email protected]
Marc NajorkMicrosoft Research
![Page 19: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/19.jpg)
IntroductionFetch a pageParse it to extract all linked URLsFor all the URLs not seen before, repeat the
process
![Page 20: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/20.jpg)
ChallengesThe web is very large (coverage)
doubling every 9-12 monthsWeb pages are changing rapidly (freshness)
all changes (40% weekly)changes by a third or more (7% weekly)
![Page 21: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/21.jpg)
CrawlersIA crawlerOriginal Google crawlerMercator web crawlerCho and Garcia-Molina’s crawlerWebFountainUbiCrawlerShkapenyuk and Suel’s crawler
![Page 22: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/22.jpg)
CachingAnalogous to OS cacheNon-uniformity of requestsTemporal correlation or locality of reference
![Page 23: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/23.jpg)
Caching algorithmsInfinite cache (INFINITE)Clairvoyant caching (MIN)Least recently used (LRU)CLOCKRandom replacement (RANDOM)Static caching (STATIC)
![Page 24: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/24.jpg)
Experimental setup
![Page 25: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/25.jpg)
URL Streamsfull tracecross sub-trace
![Page 26: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/26.jpg)
Result plots
![Page 27: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/27.jpg)
Result plots
![Page 28: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/28.jpg)
Result plots
![Page 29: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/29.jpg)
ResultsLRU & CLOCK performed equally well but slightly
worse than MIN except for critical region (for both traces)
RANDOM is slightly inferior to CLOCK and LRU, while STATIC is generally much worse
Concludes considerable locality of reference in the traces
For very large cache STATIC is better than MIN (excluding initial k misses)
STATIC is relatively better for cross traceLack of deep links, often pointing to home pages.Intersection between the most popular URLs and the
cross trace tends to be larger
![Page 30: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/30.jpg)
Critical regionMiss rate for all efficient algorithms is
constant (~70%) in k = 2^14 - 2^18Above k = 2^18 miss rate drops abruptly
to ~20%
![Page 31: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/31.jpg)
Cache Implementation
![Page 32: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/32.jpg)
Conclusions and future directions1,800 simulations over 26.86 billion URLs
resulted cache of 50,000 entries gives 80% hit rate
Cache size of 100 ~ 500 entries per thread is recommended
CLOCK or RANDOM implementation using scatter table with circular chain is recommended
To what order graph traversal method affects caching?
Global cache or per thread cache is better?
![Page 33: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND](https://reader036.vdocument.in/reader036/viewer/2022070416/5697c0251a28abf838cd4e89/html5/thumbnails/33.jpg)
THANKS