hierarchical link analysis for ranking web data · link analysis on the web link analysis given a...
TRANSCRIPT
Hierarchical Link Analysis for Ranking Web Data
Renaud Delbru, Nickolai Toupikov, Michele Catasta, GiovanniTummarello, and Stefan Decker
Digital Enterprise Research Institute, Galway
June 1, 2010
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Introduction
Web of Data
There is a growing increase of web data sources ...
Linked Open Data cloud;Open Graph protocol;e-commerces (good relations), e-government, ...
How to search and retrieve relevant information ?
One single query can return million of entities ...... and users expect only the most relevant ones.Web data search engines (e.g., Sindice) need effective way torank entities.Partial solution: Popularity-based entity ranking.
1 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structureSindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structureSindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structureSindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structureSindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes usinglink informationA link from a node i to a node j is considered as an evidence of theimportance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structureHierarchical Link Analysis consider both link structure andhierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structureSindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Outline: Web Data Model
Web Data ModelWeb Data GraphDataset GraphInternal and External NodeIntra and Inter-Dataset EdgeLinksetTwo-Layer ModelQuantifying the Two-Layer Model
3 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Web Data Graph
Figure: Web data graph
4 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset Graph
Figure: Dataset graph
5 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Internal and External Node
Figure: Internal (red) and external nodes (blue)
6 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Intra and Inter-Dataset Edge
Figure: Inter-dataset (orange) and intra-dataset (black) edges
7 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Linkset
Figure: Linkset
8 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Two-Layer Model
Figure: Two-layer model of the Web of Data
9 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Quantifying the two-layer model
Datasets
DBpedia 17.7 million of entitiesCiteseer (RKBExplorer) 2.48 million of entities
Geonames 13.8 million of entitiesSindice 60 million of entities among 50.000 datasets
Dataset Intra Inter
DBpedia 88M (93.2%) 6.4M (6.8%)Citeseer 12.9M (77.7%) 3.7M (22.3%)Geonames 59M (98.3%) 1M (1.7%)Sindice 287M (78.8%) 77M (21.2%)
Table: Ratio intra / inter dataset links
10 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Outline: The DING Model
The DING ModelOverviewUnsupervised Link WeightingComputing DatasetRankComputing Local EntityRankCombining Dataset Rank and Entity Rank
11 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the
top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link
analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and
combined with their local ranks to estimate a global entityrank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the
top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link
analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and
combined with their local ranks to estimate a global entityrank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the
top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link
analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and
combined with their local ranks to estimate a global entityrank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:1 dataset ranks are computed by performing link analysis on the
top layer (i.e. the dataset graph);2 for each dataset, entity ranks are computed by performing link
analysis on the local entity collection;3 the popularity of the dataset is propagated to its entities and
combined with their local ranks to estimate a global entityrank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Unsupervised Link Weighting
Intuition
TF-IDF applied on link labels
Link Frequency - Inverse Dataset Frequency (LF-IDF)
Link weighting factor wσ,i ,j
Assign low weight to very common links, such as rdfs:seeAlso
wσ,i ,j = LF (Lσ,i ,j)× IDF (σ) =|Lσ,i ,j |∑
Lτ,i ,k |Lτ,i ,k |× log
N
1 + freq(σ)
13 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Unsupervised Link Weighting
Intuition
TF-IDF applied on link labels
Link Frequency - Inverse Dataset Frequency (LF-IDF)
Link weighting factor wσ,i ,j
Assign low weight to very common links, such as rdfs:seeAlso
wσ,i ,j = LF (Lσ,i ,j)× IDF (σ) =|Lσ,i ,j |∑
Lτ,i ,k |Lτ,i ,k |× log
N
1 + freq(σ)
14 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Unsupervised Link Weighting
Intuition
TF-IDF applied on link labels
Link Frequency - Inverse Dataset Frequency (LF-IDF)
Link weighting factor wσ,i ,j
Assign low weight to very common links, such as rdfs:seeAlso
wσ,i ,j = LF (Lσ,i ,j)× IDF (σ) =|Lσ,i ,j |∑
Lτ,i ,k |Lτ,i ,k |× log
N
1 + freq(σ)
15 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank
DatasetRank
Weighted PageRank on the weighted dataset graph
Distribution factor wσ,i ,j is defined by LF-IDFProbability of random jump is proportional to the size of adataset
rk(Dj) = α∑Lσ,i ,j
rk−1(Di )wσ,i ,j + (1− α)|EDj|∑
D∈G |ED |
16 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank
DatasetRank
Weighted PageRank on the weighted dataset graph
Distribution factor wσ,i ,j is defined by LF-IDFProbability of random jump is proportional to the size of adataset
rk(Dj) = α∑Lσ,i ,j
rk−1(Di )wσ,i ,j + (1− α)|EDj|∑
D∈G |ED |
17 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank
DatasetRank
Weighted PageRank on the weighted dataset graphDistribution factor wσ,i ,j is defined by LF-IDF
Probability of random jump is proportional to the size of adataset
rk(Dj) = α∑Lσ,i ,j
rk−1(Di )wσ,i ,j + (1− α)|EDj|∑
D∈G |ED |
18 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfingbehaviour in PageRank
DatasetRank
Weighted PageRank on the weighted dataset graphDistribution factor wσ,i ,j is defined by LF-IDFProbability of random jump is proportional to the size of adataset
rk(Dj) = α∑Lσ,i ,j
rk−1(Di )wσ,i ,j + (1− α)|EDj|∑
D∈G |ED |
19 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Local EntityRank
Generic Algorithms
Weighted EntityRank: Weighted PageRank applied on the internalentities and intra-links of a dataset
Weighted LinkCount: in-degree counting links applied on theinternal entities and intra-links of a dataset
20 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Combining Dataset Rank and Entity Rank
Naive approach
Purely probabilistic point of view: joint probability
Assumption: independent events
Global score rg (e) = P(e ∩ D) = r(e) ∗ r(D)
Problem: favours smaller datasets
DING Approach
Add a local entity rank factor;
Normalise local ranks to a same average based on dataset size
rg (e) = r(D) ∗ r(e) ∗ |ED |∑D′∈G |E ′
D |
21 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Outline: Experimental Results
Experimental ResultsOverviewUser StudySemSearch 2010
22 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Experimental Results: Overview
Link Analysis Methods
Global EntityRank (GER);
Local LinkCount (LLC) and Local EntityRank (LER);
Local algorithms combined with DatasetRank (DR-LLC andDR-LER).
Experiments
1 User study to evaluate qualitatively each methods;
2 Semantic Search challenge.
23 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study: Design
Exp-A
Local entity ranking (LER & LLC) on DBpedia dataset31 participants
Exp-B
DING (DR-LER & DR-LLC) on Sindice’s page-repository58 participants
Task
10 queries (keyword and SPARQL queries)One result list (top-10) per algorithmRate algorithms (W, SW, S, SB, B) in relation to GER
24 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study: Questionnaire
Figure: One of the questionnaire given to the participant
25 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study A: Results
(a) LER
Rate Oi Ei %χ2
B 0 6.2 −13%SB 7 6.2 +0%S 21 6.2 +71%SW 3 6.2 −3%W 0 6.2 −13%Totals 31 31
(b) LLC
Rate Oi Ei %χ2
B 3 6.2 −12%SB 8 6.2 +4%S 13 6.2 +53%SW 6 6.2 −0%W 1 6.2 −31%Totals 31 31
Table: Chi-square test for Exp-A. The column %χ2 gives, for eachmodality, its contribution to χ2 (in relative value).
Conclusion
LER and LLC provides similar results than GER. However, there isa more significant proportion of the population that considers LERmore similar to GER.
26 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study B: Results
(a) DR-LER
Rate Oi Ei %χ2
B 12 11.6 +0%SB 12 11.6 +0%S 22 11.6 +57%SW 9 11.6 −4%W 3 11.6 −39%Totals 58 58
(b) DR-LLC
Rate Oi Ei %χ2
B 7 11.6 −9%SB 24 11.6 +65%S 13 11.6 +1%SW 10 11.6 −1%W 4 11.6 −24%Totals 58 58
Table: Chi-square test for Exp-B. The column %χ2 gives, for eachmodality, its contribution to χ2 (in relative value).
Conclusion
It appears that DR-LLC provides a better effectiveness. A largeproportion of the population finds it slightly better than GER, andthis is reinforced by a few number of people finding it worse.
27 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
SemSearch 2010: Entity Search Track
SemSearch 2010
First semantic search evaluation;
Focus on entity search.
Experiment Design
Billion Triple Challenge 2009 dataset;
92 keyword queries;
Relevance judgement on top 10 entities.
28 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
SemSearch 2010: Experiment Results
Figure: SemSearch 2010 evaluation results
29 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Scalability: Computing Dataset Rank
Graph Node Edge
Web Data 60M 364MDataset 50K 1.2M
Table: Graph Size
DatasetRank
1 iteration ≈ 200ms;Good quality rank in few seconds.
30 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Scalability: Dataset size distribution
Power-law distribution;The majority of the datasets contain less than 1000 nodes.
31 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Scalability: Computing Entity Rank
EntityRank
55 iterations of 1 minute (for DBPedia dataset).
LinkCount
requires only 1 iteration;can be computed on the fly with appropriate data index.
32 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset-Dependent Local EntityRank
Dataset Specific Algorithms
No reason to have one generic algorithm for all datasets;We could choose appropriate entity ranking algorithm for eachdataset.
Graph Structure Dataset Algorithm
Generic, Controlled DBpedia LinkCountGeneric, Open Social Communities EntityRankHierarchical Geonames, Taxonomies DHCBipartite DBLP CiteRank
Table: List of various graph structures with appropriate algorithms
33 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset-Dependent Local EntityRank
Dataset Specific Algorithms
No reason to have one generic algorithm for all datasets;We could choose appropriate entity ranking algorithm for eachdataset.
Graph Structure Dataset Algorithm
Generic, Controlled DBpedia LinkCountGeneric, Open Social Communities EntityRankHierarchical Geonames, Taxonomies DHCBipartite DBLP CiteRank
Table: List of various graph structures with appropriate algorithms
34 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset-Dependent Local EntityRank
Dataset Specific Algorithms
No reason to have one generic algorithm for all datasets;We could choose appropriate entity ranking algorithm for eachdataset.
Graph Structure Dataset Algorithm
Generic, Controlled DBpedia LinkCountGeneric, Open Social Communities EntityRankHierarchical Geonames, Taxonomies DHCBipartite DBLP CiteRank
Table: List of various graph structures with appropriate algorithms
35 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Conclusion
DING Method
Hierarchical Link Analysis for web data;Quality comparable or even better than standard approaches;Lower computational complexity;Dataset-dependent local entity ranking.
Future Work
Investigate how to detect appropriate local entity rankingmethod for a dataset;Study query-dependent ranking and how it can be combinedwith DING ranking.
36 / 36