a method for focused crawling using combination of link structure and content similarity seyedmohsen...

Post on 20-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Method forA Method for Focused Focused

CrawlingCrawling Using Combination of Using Combination of

Link Structure and Link Structure and Content SimilarityContent SimilaritySeyedMohsen (Mohsen) JamaliSeyedMohsen (Mohsen) Jamali

m_jamali@ce.sharif.edum_jamali@ce.sharif.edu

IntroductionIntroduction

The rapid growth of the world-wide web The rapid growth of the world-wide web poses unprecedented scaling challenges for poses unprecedented scaling challenges for general-purpose crawlers and search general-purpose crawlers and search enginesengines

Focused crawler aims at selectively seek out Focused crawler aims at selectively seek out pages that are relevant to a pre-defined set pages that are relevant to a pre-defined set of topics.of topics.

Focused crawler entails a very small Focused crawler entails a very small investment in hardware and network investment in hardware and network resources and yet achieves respectable resources and yet achieves respectable coverage at a rapid ratecoverage at a rapid rate

Crawler ArchitectureCrawler Architecture

Crawler Architecture Crawler Architecture (cont)(cont)

Content Similarity Content Similarity MeasureMeasure

EvaluationsEvaluations

1.1. PrecisionPrecision

2.2. RecallRecall We ran the algorithm 2 times, one We ran the algorithm 2 times, one

with a good hub for the topic and with a good hub for the topic and the other with a general pagethe other with a general page

We compared the both results with We compared the both results with usual BFS crawlerusual BFS crawler

Experimental ResultsExperimental Results

Experimental Results Experimental Results (cont)(cont)

Experimental Results Experimental Results (cont)(cont)

Experimental Results Experimental Results (cont)(cont)

TCPTCP: Total Crawled Pages, : Total Crawled Pages, RPCRPC: Related Pages' Count: Related Pages' Count RCTRCT: Relative Crawling Time, : Relative Crawling Time, AHRAHR: Average Harvest Rate: Average Harvest Rate

AHRAHR: the mean of harvest rates in each segment: the mean of harvest rates in each segment

Experimental Results Experimental Results (cont)(cont)

THE ENDTHE END

top related