mining di dati web lezione n° 2 il grafo del web a.a 2006/2007

26
Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Upload: jayson-cannon

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Mining di dati webLezione n° 2

Il grafo del Web

A.A 2006/2007

Page 2: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The Web GraphThe linkage structure of Web Pages forms a graph structure.

The Web Graph (hereinafter called W) is a directed graph W = (V,E)V is the vertex set and each vertex represents a page in the Web.

E is the edge set and each directed edge (e1,e2) exists whenever a link appears in the page represented by e1 to the page represented by e2.

Page 3: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

A Toy Example of W

=100

==01

11=0

101=11

22

33

44

11 22 33 44

22

Link11

Link12

Link41

Link21

Link22

Link31

11

33

44

V= {1,2,3,4}

E= {(1,2), (1,4), (2,3), (2,4), (3,1), (4,3)}

11

22

33

44

22,,44

33,,44

11

33

11

22

33

44

22,,22

33,,11

11

33

Page 4: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

A More Realistic W

Page 5: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The size of W What is being measured?

Number of hosts Number of (static) html pages

Volume of data

Number of hosts - netcraft survey http://news.netcraft.com/archives/web_server_survey.html

Monthly report on how many web hosts & servers are out there!

Number of pages - numerous estimates Recently Yahoo announced an index with 20B pages.

Page 6: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The “real” size of W

The web is really infiniteDynamic content, e.g. calendars, online organizers, etc.

http://www.raingod.com/raingod/resources/Programming/JavaScript/Software/RandomStrings/index.html

Static web contains syntactic duplication, mostly due to mirroring (~ 20-30%)

Some servers are seldom connected.

Page 7: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Recent Measurement of W

[Gulli & Signorini, 2005]. Total web > 11.5B.

2.3B the pages unknown to popular Search Engines.

35-120B of pages are within the hidden web.

The index intersection between the largest available search engines -- namely Google, Yahoo!, MSN, AskJeeves -- is estimated to be 28.8%.

Page 8: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Evolution of WAll of these numbers keep changing.

Relatively few scientific studies of the evolution of the web [Fetterly & al., 2003]http://research.microsoft.com/research/sv/sv-pubs/p97-fetterly/p97-fetterly.pdf

Sometimes possible to extrapolate from small samples (fractal models) [Dill & al., 2001]http://www.vldb.org/conf/2001/P069.pdf

Page 9: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Rate of change There a number of different studies analyzing the

rate of changes of pages in V. [Cho & al., 2000] 720K pages from 270 popular sites

sampled daily from Feb 17 - Jun 14, 1999 Any changes: 40% weekly, 23% daily

[Fetterly & al., 2003] Massive study 151M pages checked over few months Significant changed -- 7% weekly Slightly changed -- 25% weekly

[Ntoulas & al., 2004] 154 large sites re-crawled from scratch weekly 8% new pages / week 8% die 5% new content 25% new links/week

Page 10: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Rate of change [Fetterly & al., 2003]

Page 11: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Rate of change [Ntoulas & al., 2004]

Page 12: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The Bow-Tie Structure

Page 13: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The Power of Power Laws

A power law relationship between two scalar quantities x and y is one where the relationship can be written as

y= axk

where a (the constant of proportionality) and k (the exponent of the power law) are constants.

Power laws are observed in many subject areas, including physics, biology, geography, sociology, economics, and linguistics.

Power laws are among the most frequent scaling laws that describe the scale invariance found in many natural phenomena.

Page 14: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Power Law Probability

Distributions Sometimes called heavy-tail or long-tail distributions.

Examples of power law probability distributions: The Pareto distribution, for example, the distribution of wealth in capitalist economies

Zipf's law, for example, the frequency of unique words in large texts http://wordcount.org/main.php

Scale-free networks, where the distribution of links is given by a power law (in particular, the World Wide Web)

Frequency of events or effects of varying size in self-organized critical systems, e.g. Gutenberg-Richter Law of earthquake magnitudes and Horton's laws describing river systems

Page 15: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The in/out-degree

Power law trend:

Pr(Xk = k) ≈ ck−β

β ≈2.1

β ≈2.55

Page 16: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

Random GraphsRGs are structures introduced by Paul Erdos and Alfred Reny.

There are several models of RGs. We are concerned with the model Gn,p.

A graph G = (V,E) Gn,p is such that |V|=n and an edge (u,v) E is selected uniformly at random with probability p.

Page 17: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

W cannot be a RGLet Xk be a discrete value indicating the number of nodes having degree equal to k.

Obviously in Gn,p the expected value of Xp

E(Xp) is .

Xk is asintotically distributed as a Poisson variable with mean k.€

nn −1

k

⎝ ⎜

⎠ ⎟pk

Pr(Xk = r) → e−λ kλ kr

r!

Page 18: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The avg distance of a graph G

Let u, v V be two nodes of G. Let d(u,v) be the distance from u to v expressed as the length of the shortest path connecting u to v. If u and v are not connected then the distance is set to .

Define

where S is the set of pairs of distinct nodes u, v of W with the property that d(u,v) is finite.€

L(G)=d(u,v)

Su,v{ }∈S

Page 19: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

The avg distance of W

A small world graph is a graph whose avg distance is much smaller that the order of the graph.

For instance L(G) O(log(|V(G)|)).

L(W) is about 7.Ld(W) is about 18

Page 20: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

What’s the best model for W?

A graph model for the web should have (at least) the following features:1.On-line property. The number of nodes and

edges changes with time.2.Power law degree distribution. The degree

distribution follows a power law, with an exponent β>2.

3.Small world property. The average distance is much smaller that the order of the graph.

4.Many dense bipartite subgraphs. The number of distinct bipartite cliques or cores is large when compared to a random graph with the same number of nodes and edges.

It is still an open problemto find a web graph modelthat produces graphs whichprovably has all four properties.

Page 21: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

W Models proposed so far.

[Bollobas & al., 2001]. Linearized Chord Diagram (LCD).

[Aiello & al., 2001]. ACL.[Chung & al., 2003]. CL.[Kumar & al., 1999]. Copying model.[Chung & al., 2004]. CL-del growth-deletion model.

[Cooper & al., 2004]. CFV.

Page 22: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

General Characteristics

Model Directed 1 2 3 4 βLCD Y Y Y Y ? 3

ACL Y Y Y ? N (2,)

CL N N Y Y ? (2,)

Copying

Y Y Y ? Y (2,)

CL-del N Y Y Y ? (2,)

CFV N Y Y ? ? (2,)

Page 23: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

References [Gulli & Signorini, 2005]. Antonio Gulli and Alessio Signorini. The indexable web is more than 11.5 billion pages. WWW (Special interest tracks and posters) 2005: 902-903.

[Fetterly & al., 2003]. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. 12th International World Wide Web Conference (May 2003), pages 669-678.

[Dill & al., 2001]. Stephen Dill, Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins: Self-similarity in the web. ACM Trans. Internet Techn. 2(3): 205-223 (2002).

Page 24: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

References [Cho & al., 2000]. Junghoo Cho, Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000: 200-209.

[Ntoulas & al., 2004]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston. What's new on the web?: the evolution of the web from a search engine perspective. WWW 2004: 1-12.

[Bollobas & al., 2001]. Bela Bollobas, Oliver Riordan, G. Tusnary and Joel Spencer. The degree sequence of a scale-free random graph process. Random Structures and Algorithms, vol 18, 2001, 279-290.

Page 25: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

References [Aiello & al., 2001]. William Aiello, Fan R. K. Chung, Linyuan Lul. Random Evolution in Massive Graphs. FOCS 2001: 510-519.

[Chung & al., 2003]. Fan R. K. Chung, L. Lu. The average distances in random graphs with given expected degrees. Internet Mathematics. 1(2003): 91-114.

[Kumar & al., 1999]. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal. Stochastic models for the Web graph. Proceedings of the 41th FOCS. 2000, pp. 57-65.

Page 26: Mining di dati web Lezione n° 2 Il grafo del Web A.A 2006/2007

References[Chung & al., 2004]. F. Chung, L. Lu. Coupling Online and Offline Analyses for Random Power Law Graphs. Internet Mathematics. Vol 1 (2003). 409-461.

[Cooper & al., 2004]. C. Cooper, A. Frieze, J. Vera. Random Deletions in a Scale Free Random Graph Process. Internet Mathematics. Vol 1 (2003). 463 - 483.