tutorial 6 (web graph attributes)

11
Web Graph Characteristics Kira Radinsky

Upload: kira

Post on 05-Dec-2014

255 views

Category:

Technology


2 download

DESCRIPTION

Part of the Search Engine course given in the Technion (2011)

TRANSCRIPT

Page 1: Tutorial 6 (web graph attributes)

Web Graph Characteristics

Kira Radinsky

Page 2: Tutorial 6 (web graph attributes)

2

The Web as a Graph

Pages as graph nodes, hyperlinks as edges.– Sometimes sites are taken as the nodes

Some natural questions:1. Distribution of the number of in-links to a page.

2. Distribution of the number of out-links from a page.

3. Distribution of the number of pages in a site.

4. Connectivity: is it possible to reach most pages from most pages?

5. Is there a theoretical model that fits the graph?

Page 3: Tutorial 6 (web graph attributes)

3

Mathematical Background:Power-Law Distributions

• A non-negative random variable X is said to have a Power-Law distribution if, for some constants c>0 and α>0:Prob[X>x] ~ x-α, or equivalently f(x) ~ x-(α+1)

• Taking logs from both sides, we have:log Prob[X>x] = -αlog(x) + c

• Power Law distributions have “heavy/long tails”, i.e. the probability mass of events whose value is far from the expectancy or median of the distribution is significant– Unlike Normal or Geometric/Exponential distributions, where the probability

mass of the tail decreases exponentially, in Power Law distributions the mass of the tail decreases by the constant power of α

– Another point of view: in an Exponential distribution, f(x)/p(x+k) is constant, whereas in a Power-Law distribution, f(x)/f(kx) is constant.

– The “average” quantity in a Power-Law distribution is not “typical”

• Examples of Power-Law distributions are Pareto and Zipfdistributions (see next slides)

Page 4: Tutorial 6 (web graph attributes)

4

Mathematical Background:The Pareto Distribution

• A continuous, positive random variable X in the range [L,] is said to be distributed Pareto(L,k) if its probability density function is:f(X=x;k;L) = k Lk / xk+1

• This implies that Prob(X>x) = (L/x)k

– Has finite expectancy of Lk/(k-1) only for k>1– Has finite variance only for k>2

• Named after the Italian economist Vilfredo Pareto (1848-1923), who modeled with it the distribution of wealth in society– Most people have little income; 20% of society holds 80% of the

wealth

Page 5: Tutorial 6 (web graph attributes)

5

Mathematical Background:Zipf’s Law

• A random variable X follows Zipf’s Law (is “Zipfian”) with parameter α when the j’th most popular value of X occurs with probability that is proportional to j-α

– Essentially the distribution is over the discrete ranks

• Whenever α>1, X may take an infinite number of values (i.e. have infinitely many different value popularities)

• Named after the American Linguist George Kingsley Zipf (1902-1950), who observed it on the frequencies of words in the English language– On a large corpus of English text, the 135 most frequently occurring

words accounted for half of the text

Page 6: Tutorial 6 (web graph attributes)

6

Mathematical Background:An Observed Zipfian Sample Implies a Power-Law

The following analysis is due to Lada Adamic:

• Assume that N units of wealth (coins) are distributed to M individuals– There are N observations of a random variable Y that can take on

the discrete values 1,2,…,M• Yk=j (k=1,…N, j=1..M) means that person j got coin k

– Denote by X1[Xm] the number of coins of the richest[poorest] individual at the end of the process• For simplicity, assume that N>>M and the Xj’s are all distinct

• Assume that a perfect Zipfian behavior is observed, i.e. Xr/N ~ r-b for all r=1,…M– This trivially implies Xr ~ r-b

Page 7: Tutorial 6 (web graph attributes)

7

Mathematical Background:An Observed Zipfian Sample Implies a Power-Law (cont.)

• Recap: we distributed N coins to M individuals, and denoted by X1[Xm] the number of coins of the richest[poorest] individual at the end of the process

• By assuming Zipfian wealth: Xr ~ r-b, or Xr=cr-b

• Let Z be the random variable of a person’s wealth, i.e. the number of coins a person gets by this process

• Observation: if the r’th richest person got Xr coins, then exactly r people out of M got Xr coins or more

• Pr[Z Xr]=Pr[Z cr-b]=r/M• Define y= cr-b, and so r=(y/c)-(1/b), and so

Pr[Z y]= y-(1/b) c(1/b)/M• Hence Pr[Z y] ~ y-(1/b), and Z obeys a Power-Law

Page 8: Tutorial 6 (web graph attributes)

8

Distribution of Inlinks

* Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000.

A plot of the number of nodes having each value of in-degree

Both axes are in log-scale

Denoting the size of the sample crawl by N (over 200M here), we have:

Log (N*Prob[node has in-degree x]) -a*log(x)+cLog (Prob[node has in-degree x]) -a*log(x)+c’

Which indicates the Power-LawProb[node has in-degree x] ~ x-a

Note that the number of nodes with small in-degree is over-estimated while the number of nodes with very high in-degree is under-estimated

Page 9: Tutorial 6 (web graph attributes)

9

More Power-Laws on the Web

We’ve seen that the in-degree of pages exhibits a Power-Law. Furthermore:

• Out-degree (somewhat surprising)

• Degrees of the inter-host graph

• Number of pages in Web sites

• Number of visits to Web sites/pages

• PageRank scores– With an exponent very close to that of the in-degree distribution

– Curiously, degrees in the telephone call graph have the same 2.1 exponent

• Frequencies of words (as observed by Zipf)

• Popularities of queries submitted to search engines (will be discussed later in the course)

Page 10: Tutorial 6 (web graph attributes)

10

The Web as a Graph

Connectivity: is it possible to reach most pages from most pages?

The Web is a bow-tie!

The Web graph is also scale-free, fractal: many slices and subgraphs exhibit similar properties.

Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000.

Page 11: Tutorial 6 (web graph attributes)

11

Self-Similarity on the WebDill et al., ACM TOIT 2002

• Created large Thematically Unified Clusters (TUCs)• Pages containing a certain keyword

• Pages of large Web sites/Intranets

• Pages containing a geographical reference in the Western US

• The host graph

• In general, the TUCs display very similar graph properties, e.g.

• In/out degree distributions

• Bow-tie structure (relative sizes of the components)

• Also discovered that the SCC of the different TUCs are strongly connected, i.e. it is possible to browse between the TUCs