![Page 1: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/1.jpg)
Measuring the Size of the Web
Dongwon Lee, Ph.D.
IST 501, Fall 2014
Penn State
![Page 2: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/2.jpg)
Studying the Web
To study the characteristics of the Web Statistics Topology Behavior …
Why Scientific curiosity Practical values
Eg, search engine coverage2
Nature 1999
![Page 3: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/3.jpg)
Web as Platform
Web becomes a new computation platform Pauses new challenges
Scale Efficiency Heterogeneity Impact to People’s lives
3
![Page 4: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/4.jpg)
Eg, How Big is the Web?
Q1: How many web sites?
Q2: How many web pages?
Q3: How many surface/deep web pages?
Research Method Mostly used Experimental method to validate
novel solutions
4
![Page 5: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/5.jpg)
Q1: How Many Web Sites?
DNS Registrars List of domain names
Issues Not every domain is web site A domain contains more than one web site Registrars are under no obligations for their
correctness So many of them …
5
![Page 6: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/6.jpg)
6
How Many Web Sites?
Brute-force: Polling every IP IPv4: 256.256.256.256
2^32 = 4 billion IPv6: 2^128
10 sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days
Not going to work !!
![Page 7: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/7.jpg)
7
How Many Web Sites? 2nd attempt: Sampling
T: All 4 Billion IPs
S: Sampled IPs
V: Valid reply
||||
||T
S
V
![Page 8: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/8.jpg)
8
How Many Web Sites?
||||
||T
S
V
1.Select |S| random IPs2.Send HTTP requests to port 80 at the
selected IPs3.Count valid replies: “HTTP 200 OK” = |V|4. |T| = 2^32
Q: What are the issues here?
![Page 9: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/9.jpg)
9
Issues
Virtual hosting Ports other than 80 Temporarily unavailable sites …
![Page 10: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/10.jpg)
10
OCLC Survey (2002)
OCLC (Online
Computer Library)
Results
http://wcp.oclc.org/ Still room for growth (at least for Web sites) ??
![Page 11: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/11.jpg)
NetCraft Web Server Survey (2010) Goal is to measure web server market share Also record # of sites their crawlers visited August 2010: 213,458,815 distinct sites
11http://news.netcraft.com/archives/category/web-server-survey/
![Page 12: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/12.jpg)
NetCraft Web Server Survey (2013) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 716,822,317 distinct sites
12http://news.netcraft.com/archives/category/web-server-survey/
![Page 13: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/13.jpg)
NetCraft Web Server Survey (2014) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 992,177,228 distinct sites
13http://news.netcraft.com/archives/category/web-server-survey/
![Page 14: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/14.jpg)
14
Q2: How Many Web Pages? Sampling based?
Issue here?
T: All URLs
S: Sampled URLs
V: Valid reply ||||
||T
S
V
![Page 15: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/15.jpg)
15
How Many Web Pages?
Method #1: For each site with valid reply, download all pages Measure average # of pages per site Avg # of pages X total # of sites
Result [Lawrence & Giles, 1999] 289 pages per site, 2.8M sites 289 * 2.8M =~ 800M web pages
![Page 16: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/16.jpg)
16
Further Issues
A small #of sites with TONS of pages Sampling could miss these sites
Majority of sites with small # of pages Lots of samples necessary
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
0 200 400 600 800 1000
No of Sites
No
of
Pa
ge
s
99.99% of the sites
![Page 17: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/17.jpg)
17
How Many Web Pages?
Method #2: Random sampling
Assume:
T: All pages
B: Base setS: Random samples
![Page 18: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/18.jpg)
18
Random Page?
Idea: Random walk Start from a Portal home page (eg, Yahoo) Estimate the size of the portal: B Follow random links, say 10,000 times Select the pages At the end, a set of random web pages S are
gathered
![Page 19: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/19.jpg)
19
Straightforward Random Walk
google.com
amazon.com
pike.psu.edu
Follow a random out-link at each step 1
2
3
4
56
7
8
9
Issues?
![Page 20: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/20.jpg)
20
Straightforward Random Walk
google.com
amazon.com
pike.psu.edu
Follow a random out-link at each step 1
2
3
4
56
7
8
9
1. Gets stuck in sinks and in dense Web communities2. Biased towards popular pages3. Converges slowly, if at all
Issues?
![Page 21: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/21.jpg)
21
Going to Converge? Random walks on regular, undirected graph
uniformly distributed sample
Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution
: depends on the graph structure N: number of nodes
Idea: Transform the Web graph to a regular, undirected graph Perform a random walk
Problem Web is neither regular nor undirected
NO log1
![Page 22: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/22.jpg)
22
Intuition
Random walk on undirected Web graph (not regular) High chance to be at a “popular” node at a
particular time Increase the chance to be at a “unpopular”
node by staying there longer through self loop.
Unpopular nodesPopular node
![Page 23: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/23.jpg)
23
WebWalker: Undirected Regular Random Walk on the Web
Fact:
A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps.
w(v) = degmax - deg(v)
google.com
pike.psu.edu
1
2
31
amazon.com
4
0
23
03
2
2
4
4
3
3
3
1
2
5Follow a random out-link or a random in-link at each step
Use weighted self loops to even out pages’ degrees
![Page 24: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/24.jpg)
24
Ideal Random Walk
Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page:
say, 300,000 If edge(n) < 300,000, then add self-loop
Perform random walks on the graph 10-5 for the 1996 Web, N 109
![Page 25: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/25.jpg)
25
WebWalker Results (2000)
Size of the Web pages Altavista: |B| = 250M |BS|/|S| = 35% Estimated |T| = ~ 720M
Avg page size: 12K Avg # of out-links: 10
Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz, Approximating Aggregate Queries about Web Pages
via Random Walks. VLDB, 2000
![Page 26: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/26.jpg)
How large is SE’s Index?
Prepare a representative corpus (eg, DMOZ) Draw a word W with known frequency
percentage F Eg, “The” is present in 60% of all documents
within the corpus Submit W to a search engine E If E reports there are X number of documents
containing W, one can extrapolate the total size of E’s index as=~ X / F
Repeat multiple times for computing average26
![Page 27: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/27.jpg)
http://www.worldwidewebsize.com/ (2010)
27
28 Billions
![Page 28: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/28.jpg)
http://www.worldwidewebsize.com/ (2011)
28
46 Billions
![Page 29: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/29.jpg)
http://www.worldwidewebsize.com/ (2013)
29
46 Billions
![Page 30: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/30.jpg)
http://www.worldwidewebsize.com/ (2013)
30
10 Billions
![Page 31: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/31.jpg)
Google Reveals Itself (2008) 1998: 26 Million URLs 2000: 1 Billion URLs 2008: 1 trillion URLs
Not all of them are indexed Duplicates Auto-generated (eg, Calendar) Spams
Experts suspect (2010) Google index at least 40 Billions
31
![Page 32: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/32.jpg)
32
Deep Web (aka Hidden Web)
HTML FORM InterfaceQuery Answers
![Page 33: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/33.jpg)
33
Q3: Size of Deep Web?
Deep Web: Information reachable only through query interface (eg, HTML FORM)
Often backed by DBMS
Estimation:
How to estimate? By sampling
(Avg size of record) X (Avg # of records per site) X
(Total # of Deep Web sites)
![Page 34: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/34.jpg)
34
Size of Deep Web? Total # of Deep Web sites:
|BS|/|S|
Avg size of a record: Issue random queries Estimate reply size
Avg # of records per site: Permute all possible queries for the FORM Issue all queries and count valid return
![Page 35: Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State](https://reader037.vdocument.in/reader037/viewer/2022110211/56649eef5503460f94bffcaa/html5/thumbnails/35.jpg)
35
Size of Deep Web (2005)
BrightPlanet report estimates: Avg size of a record: 14KB Avg # of records per site: 5MB Total # of Deep Web sites: 200,000 Size of the Deep Web: 10^16 (10 petabytes) 1,000 times larger than the “Surface Web”
How to access it? Wrapper/Mediator (aka. Web scrapping)
http://brightplanet.com/the-deep-web/deep-web-faqs/ : obsolete now