Download - The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University
![Page 1: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/1.jpg)
The Evolution of the Weband Implications for
an Incremental Crawler
Junghoo ChoStanford University
![Page 2: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/2.jpg)
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
![Page 3: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/3.jpg)
Crawling Issues (1) Load at visited web sites Load at crawlers Scope of the crawl
![Page 4: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/4.jpg)
Crawling Issues (2)
Typical crawler Periodic, Batch, Shadowing
Incremental crawling Maintain Pages “fresh” Avoid crawling from scratch
How do we crawl?
![Page 5: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/5.jpg)
Outline Web evolution experiments Freshness metrics Design issues and comparison
![Page 6: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/6.jpg)
Web Evolution Experiment How often does a web page
change? What is the lifespan of a page? How long does it take for 50% of
the web to change?
![Page 7: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/7.jpg)
Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission)
identified 400 sites with highest “page rank” contacted administrators
720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new &
old pages) ran only 9pm - 6am, 10 seconds between
site requests
![Page 8: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/8.jpg)
How Often Does a Page Change?
Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days
Is this correct?
1 day
changes
page visited
![Page 9: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/9.jpg)
Average Change Intervalfr
actio
n of
pag
es
![Page 10: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/10.jpg)
Average Change Interval — By Domain
frac
tion
of p
ages
![Page 11: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/11.jpg)
How Long Does a Page Live?
experimentduration
pagelifetime
experimentduration
pagelifetime
experimentduration
pagelifetime
experimentduration
pagelifetime
![Page 12: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/12.jpg)
Page Lifespans
frac
tion
of p
ages
![Page 13: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/13.jpg)
Page Lifespans
Method 1 used
fraction of pages
![Page 14: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/14.jpg)
Time for a 50% Change
days
frac
tion
of u
ncha
nged
pag
es
![Page 15: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/15.jpg)
Change Metrics Freshness [SIGMOD 2000]
Freshness of element ei at time t is
F(ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
ei ei
......
web database Freshness of the database S at time t is
F(S ;t ) = F(ei ;t )N
1 N
i=1
![Page 16: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/16.jpg)
Change Metrics
Age [SIGMOD 2000] Age of element ei at time t is
A(ei ; t ) = 0 if ei is up-to-date at time t t - (modification ei time) otherwise
ei ei
......
web database Age of the database S at time t is
A(S ; t ) = A(ei ; t )N
1 N
i=1
![Page 17: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/17.jpg)
Crawler Types In-place vs. shadow
Steady vs. batch
ei ei......
web database
ei
...
shadowdatabase
time
crawler on
crawler off
![Page 18: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/18.jpg)
Comparison: Batch vs. Steady
batch modein-placecrawler
steadyin-placecrawler
crawler running
![Page 19: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/19.jpg)
Shadowing Steady Crawler
craw
ler’
s co
llect
ion
curr
ent c
olle
ctio
n
withoutshadowing
![Page 20: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/20.jpg)
Shadowing Batch Crawlercr
awle
r’s
colle
ctio
ncu
rren
t col
lect
ion
withoutshadowing
![Page 21: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/21.jpg)
Experimental Data: Freshness
Steady BatchIn-Place 0.88 0.88Shadowing 0.77 0.86
• Pages change on average every 4 months• Batch crawler works one week out of 4
1
2
0.63
0.50
![Page 22: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/22.jpg)
Uniform vs. Variable
Freshness AgeUniform 0.57 5.6 daysVariable 0.62 4.3 days
In-place, steady crawler;Based on our experimental data[Pages change at different frequencies,as measured in experiment.]
[SIGMOD 2000]
![Page 23: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/23.jpg)
Summary
Steady In-place Variable visit frequencies
Improvement depends on on how the web changes
improves freshness!
![Page 24: The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University](https://reader036.vdocument.in/reader036/viewer/2022062715/56649d795503460f94a5d015/html5/thumbnails/24.jpg)
The End The paper proposes an
architecture Thank you for your attention