measuring the size of the web dongwon lee, ph.d. ist 501, fall 2014 penn state
Post on 03-Jan-2016
215 Views
Preview:
TRANSCRIPT
Measuring the Size of the Web
Dongwon Lee, Ph.D.
IST 501, Fall 2014
Penn State
Studying the Web
To study the characteristics of the Web Statistics Topology Behavior …
Why Scientific curiosity Practical values
Eg, search engine coverage2
Nature 1999
Web as Platform
Web becomes a new computation platform Pauses new challenges
Scale Efficiency Heterogeneity Impact to People’s lives
3
Eg, How Big is the Web?
Q1: How many web sites?
Q2: How many web pages?
Q3: How many surface/deep web pages?
Research Method Mostly used Experimental method to validate
novel solutions
4
Q1: How Many Web Sites?
DNS Registrars List of domain names
Issues Not every domain is web site A domain contains more than one web site Registrars are under no obligations for their
correctness So many of them …
5
6
How Many Web Sites?
Brute-force: Polling every IP IPv4: 256.256.256.256
2^32 = 4 billion IPv6: 2^128
10 sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days
Not going to work !!
7
How Many Web Sites? 2nd attempt: Sampling
T: All 4 Billion IPs
S: Sampled IPs
V: Valid reply
||||
||T
S
V
8
How Many Web Sites?
||||
||T
S
V
1.Select |S| random IPs2.Send HTTP requests to port 80 at the
selected IPs3.Count valid replies: “HTTP 200 OK” = |V|4. |T| = 2^32
Q: What are the issues here?
9
Issues
Virtual hosting Ports other than 80 Temporarily unavailable sites …
10
OCLC Survey (2002)
OCLC (Online
Computer Library)
Results
http://wcp.oclc.org/ Still room for growth (at least for Web sites) ??
NetCraft Web Server Survey (2010) Goal is to measure web server market share Also record # of sites their crawlers visited August 2010: 213,458,815 distinct sites
11http://news.netcraft.com/archives/category/web-server-survey/
NetCraft Web Server Survey (2013) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 716,822,317 distinct sites
12http://news.netcraft.com/archives/category/web-server-survey/
NetCraft Web Server Survey (2014) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 992,177,228 distinct sites
13http://news.netcraft.com/archives/category/web-server-survey/
14
Q2: How Many Web Pages? Sampling based?
Issue here?
T: All URLs
S: Sampled URLs
V: Valid reply ||||
||T
S
V
15
How Many Web Pages?
Method #1: For each site with valid reply, download all pages Measure average # of pages per site Avg # of pages X total # of sites
Result [Lawrence & Giles, 1999] 289 pages per site, 2.8M sites 289 * 2.8M =~ 800M web pages
16
Further Issues
A small #of sites with TONS of pages Sampling could miss these sites
Majority of sites with small # of pages Lots of samples necessary
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
0 200 400 600 800 1000
No of Sites
No
of
Pa
ge
s
99.99% of the sites
17
How Many Web Pages?
Method #2: Random sampling
Assume:
T: All pages
B: Base setS: Random samples
18
Random Page?
Idea: Random walk Start from a Portal home page (eg, Yahoo) Estimate the size of the portal: B Follow random links, say 10,000 times Select the pages At the end, a set of random web pages S are
gathered
19
Straightforward Random Walk
google.com
amazon.com
pike.psu.edu
Follow a random out-link at each step 1
2
3
4
56
7
8
9
Issues?
20
Straightforward Random Walk
google.com
amazon.com
pike.psu.edu
Follow a random out-link at each step 1
2
3
4
56
7
8
9
1. Gets stuck in sinks and in dense Web communities2. Biased towards popular pages3. Converges slowly, if at all
Issues?
21
Going to Converge? Random walks on regular, undirected graph
uniformly distributed sample
Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution
: depends on the graph structure N: number of nodes
Idea: Transform the Web graph to a regular, undirected graph Perform a random walk
Problem Web is neither regular nor undirected
NO log1
22
Intuition
Random walk on undirected Web graph (not regular) High chance to be at a “popular” node at a
particular time Increase the chance to be at a “unpopular”
node by staying there longer through self loop.
Unpopular nodesPopular node
23
WebWalker: Undirected Regular Random Walk on the Web
Fact:
A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps.
w(v) = degmax - deg(v)
google.com
pike.psu.edu
1
2
31
amazon.com
4
0
23
03
2
2
4
4
3
3
3
1
2
5Follow a random out-link or a random in-link at each step
Use weighted self loops to even out pages’ degrees
24
Ideal Random Walk
Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page:
say, 300,000 If edge(n) < 300,000, then add self-loop
Perform random walks on the graph 10-5 for the 1996 Web, N 109
25
WebWalker Results (2000)
Size of the Web pages Altavista: |B| = 250M |BS|/|S| = 35% Estimated |T| = ~ 720M
Avg page size: 12K Avg # of out-links: 10
Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz, Approximating Aggregate Queries about Web Pages
via Random Walks. VLDB, 2000
How large is SE’s Index?
Prepare a representative corpus (eg, DMOZ) Draw a word W with known frequency
percentage F Eg, “The” is present in 60% of all documents
within the corpus Submit W to a search engine E If E reports there are X number of documents
containing W, one can extrapolate the total size of E’s index as=~ X / F
Repeat multiple times for computing average26
http://www.worldwidewebsize.com/ (2010)
27
28 Billions
http://www.worldwidewebsize.com/ (2011)
28
46 Billions
http://www.worldwidewebsize.com/ (2013)
29
46 Billions
http://www.worldwidewebsize.com/ (2013)
30
10 Billions
Google Reveals Itself (2008) 1998: 26 Million URLs 2000: 1 Billion URLs 2008: 1 trillion URLs
Not all of them are indexed Duplicates Auto-generated (eg, Calendar) Spams
Experts suspect (2010) Google index at least 40 Billions
31
32
Deep Web (aka Hidden Web)
HTML FORM InterfaceQuery Answers
33
Q3: Size of Deep Web?
Deep Web: Information reachable only through query interface (eg, HTML FORM)
Often backed by DBMS
Estimation:
How to estimate? By sampling
(Avg size of record) X (Avg # of records per site) X
(Total # of Deep Web sites)
34
Size of Deep Web? Total # of Deep Web sites:
|BS|/|S|
Avg size of a record: Issue random queries Estimate reply size
Avg # of records per site: Permute all possible queries for the FORM Issue all queries and count valid return
35
Size of Deep Web (2005)
BrightPlanet report estimates: Avg size of a record: 14KB Avg # of records per site: 5MB Total # of Deep Web sites: 200,000 Size of the Deep Web: 10^16 (10 petabytes) 1,000 times larger than the “Surface Web”
How to access it? Wrapper/Mediator (aka. Web scrapping)
http://brightplanet.com/the-deep-web/deep-web-faqs/ : obsolete now
top related