link analysis in national web domains
TRANSCRIPT
Outline Motivation Results Conclusions
Link Analysis in National Web Domains
Ricardo Baeza-Yates and Carlos Castillo
ICREA / Catedra Telefonica, Universitat Pompeu Fabrahttp://www.upf.edu/dtecn/
OSWIR 2005Compiegne, FranceSeptember 19, 2005
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
1 Motivation
2 Results
3 Conclusions
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Motivation
Sampling the Web
X We don’t have access to a global-scale collection
X A set of Web sites in the same organization is not diverseenough
X A set of Web sites in the same topic might not berepresentative
X A set of random Web sites might not be connected
V A national domain has a good balance betweendiversity and completeness
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Motivation
Sampling the Web
X We don’t have access to a global-scale collection
X A set of Web sites in the same organization is not diverseenough
X A set of Web sites in the same topic might not berepresentative
X A set of random Web sites might not be connected
V A national domain has a good balance betweendiversity and completeness
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Motivation
Sampling the Web
X We don’t have access to a global-scale collection
X A set of Web sites in the same organization is not diverseenough
X A set of Web sites in the same topic might not berepresentative
X A set of random Web sites might not be connected
V A national domain has a good balance betweendiversity and completeness
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Motivation
Sampling the Web
X We don’t have access to a global-scale collection
X A set of Web sites in the same organization is not diverseenough
X A set of Web sites in the same topic might not berepresentative
X A set of random Web sites might not be connected
V A national domain has a good balance betweendiversity and completeness
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Motivation
Sampling the Web
X We don’t have access to a global-scale collection
X A set of Web sites in the same organization is not diverseenough
X A set of Web sites in the same topic might not berepresentative
X A set of random Web sites might not be connected
V A national domain has a good balance betweendiversity and completeness
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Collections used
V Different economical, historical, linguistic, geographicalcontexts
Collection Year
Brazil 2005
Chile 2004
Greece 2004
Indochina 2004
Italy 2004
South Korea 2004
Spain 2004U. K. 2002
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Collections used
Collection Year Available hosts Pages[mill] (rank) [mill]
Brazil 2005 3.9 11th 4.7
Chile 2004 0.3 42th 3.3
Greece 2004 0.3 40th 3.7
Indochina 2004 0.5 38th 7.4
Italy 2004 9.3 4th 41.3
South Korea 2004 0.2 47th 8.9
Spain 2004 1.3 25th 16.2U. K. 2002 4.4 10th 18.5
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Scale-free topology
If we sort pages by the number of in-links, the k th pagehas indegree proportional to k−α (Zipf’s Law).
= The fraction of pages with x in-links is proportional tox−θ (Power law). Experimentally, θ ≈ 2.1 on the Web
Partial explanation: a multiplicative process; if dt is thenumber of links at time t, then dt+1 = C × dt .
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Scale-free topology
If we sort pages by the number of in-links, the k th pagehas indegree proportional to k−α (Zipf’s Law).
= The fraction of pages with x in-links is proportional tox−θ (Power law). Experimentally, θ ≈ 2.1 on the Web
Partial explanation: a multiplicative process; if dt is thenumber of links at time t, then dt+1 = C × dt .
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Scale-free topology
If we sort pages by the number of in-links, the k th pagehas indegree proportional to k−α (Zipf’s Law).
= The fraction of pages with x in-links is proportional tox−θ (Power law). Experimentally, θ ≈ 2.1 on the Web
Partial explanation: a multiplicative process; if dt is thenumber of links at time t, then dt+1 = C × dt .
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
In-degree
10−710−610−510−410−310−210−1
100 101 102 103 104
Brazil
10−710−610−510−410−310−210−1
100 101 102 103 104
Chile
10−710−610−510−410−310−210−1
100 101 102 103 104
Greece
10−710−610−510−410−310−210−1
100 101 102 103 104
Italy
10−710−610−510−410−310−210−1
100 101 102 103 104
Korea
10−710−610−510−410−310−210−1
100 101 102 103 104
Spain
10−710−610−510−410−310−210−1
100 101 102 103 104
U.K.
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Out-degree
10−6
10−5
10−4
10−3
10−2
10−1
100 101 102 103
Brazil
10−6
10−5
10−4
10−3
10−2
10−1
100 101 102 103
Chile
10−6
10−5
10−4
10−3
10−2
10−1
100 101 102 103
Greece
10−6
10−5
10−4
10−3
10−2
10−1
100 101 102 103
Italy
10−6
10−5
10−4
10−3
10−2
10−1
100 101 102 103
Korea
10−6
10−5
10−4
10−3
10−2
10−1
100 101 102 103
Spain
10−6
10−5
10−4
10−3
10−2
10−1
100 101 102 103
U.K.
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Link scores (PageRank, Hubs, Authorities)
10-7
10-6
10-5
10-4
10-3
10-2
10-7 10-6 10-5 10-4
Brazil
10-7
10-6
10-5
10-4
10-3
10-2
10-7 10-6 10-5 10-4
Chile
10-7
10-6
10-5
10-4
10-3
10-2
10-7 10-6 10-5 10-4
Greece
10-7
10-6
10-5
10-4
10-3
10-2
10-7 10-6 10-5 10-4
Korea
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Brazil
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Chile
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Greece
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Korea
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Brazil
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Chile
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Greece
10-7
10-6
10-5
10-4
10-3
10-7 10-6 10-5 10-4
Korea
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Power-law exponents
Collection In- Degree
Brazil 1.9
Chile 2.0
Greece 1.9
Indochina 1.6
Italy 1.8
South Korea 1.9
Spain 2.1U. K. 1.8
(Broder. . . 2000) 2.1(Dill. . . 2002) 2.1(Kleinberg. . . 1999) ≈ 2
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Power-law exponents
Collection In- Outdegree Page- HITSdegree Small Large Rank Hubs Auth.
Brazil 1.9 0.7 2.7 1.8 2.9 1.8
Chile 2.0 0.7 2.6 1.9 2.7 1.9
Greece 1.9 0.6 1.9 1.8 2.6 1.8
Indochina 1.6 0.7 2.6
Italy 1.8 0.7 2.5
South Korea 1.9 0.3 2.0 1.8 3.7 1.8
Spain 2.1 0.9 4.2 2.0U. K. 1.8 0.7 3.4
(Broder. . . 2000) 2.1 2.7(Dill. . . 2002) 2.1 2.2(Pandurangan. . . 2002) 2.1(Kleinberg. . . 1999) ≈ 2
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Hostgraph
www.example1.com
www.example2.com
www.example3.com
S1
S2
S3
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Hostgraph also exhibits a power-law
Hostgraph degreeCollection In Out
Brazil 1.9 1.9
Chile 2.0 1.7
Greece 2.0 1.6
South Korea 1.2 1.4
Spain 1.8 1.3(Bharat. . . 2001) 1.6-1.7 1.7-1.8(Dill. . . 2002) 2.3
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Web structure: connected components
“Normal” vs “Giant” strongly connected components
10-610-510-410-310-210-1100
100 101 102 103 104 105
Brazil
10-610-510-410-310-210-1100
100 101 102 103 104 105
Chile
10-610-510-410-310-210-1100
100 101 102 103 104 105
Greece
10-610-510-410-310-210-1100
100 101 102 103 104 105
Korea
10-610-510-410-310-210-1100
100 101 102 103 104 105
Spain
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Conclusions
V Consistent results across collections
V Differences in the amount of spam
V Comparison of other aspects [to be available soon]
Thank you
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Conclusions
V Consistent results across collections
V Differences in the amount of spam
V Comparison of other aspects [to be available soon]
Thank you
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Conclusions
V Consistent results across collections
V Differences in the amount of spam
V Comparison of other aspects [to be available soon]
Thank you
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Conclusions
V Consistent results across collections
V Differences in the amount of spam
V Comparison of other aspects [to be available soon]
Thank you
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/
Outline Motivation Results Conclusions
Conclusions
V Consistent results across collections
V Differences in the amount of spam
V Comparison of other aspects [to be available soon]
Thank you
Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain
Link Analysis in National Web Domains http://www.upf.edu/dtecn/