Download - Database Group@CSE k-Nearest Neighbors in Uncertain Graphs Lin Yincheng 2011-02-28 VLDB10
Database Group@CSE
k-Nearest Neighbors in Uncertain Graphs
Lin Yincheng2011-02-28
VLDB10
Database Group@CSE
Outline
• Background• Motivation• Problem Definition• Query Answering Approach• Experimental Results
Database Group@CSE
Background
k-Nearest Neighbors Uncertain Graphs
15
15
55
5
Find out 2-nearest neighbors for vertex B
Database Group@CSE
Motivation
Distance Path Probability
5 B-D 0.3
20 B-A-DB-C-D
0.25648
∞ No path 0.44352
• Define meaningful distance functions which is more useful to identify true neighbors
• Introduce a novel pruning algorithm to process knn queries in uncertain graphs.
15(0.2)
15(0.6)
5(0.7)5(0.3)
5(0.4)
most-probable-path-distance
Database Group@CSE
Problem Definition
• Assumption: Independence among edges• Probabilistic Graph Model G(V, E, P, W)
• V and E denote the set of nodes and edges respectively;
• P denotes the probabilities associated with each edge;
• W assigns each edge with a weight
• k-NN Query
Database Group@CSE
Distances
• Median-Distance(s, t)
• Majority-Distance(s, t)
• Expected-Reliable-Distance(s, t)
Database Group@CSE
Challenges
• For computation of median-distance and majority-distance, we need to obtain their distributions over all possible worlds.
• For computation of expected-reliable-distance, it has been proved as a #P hard problem.
Database Group@CSE
Sampling
Database Group@CSE
Sample Size for Median-D
Database Group@CSE
Sample Size for E-R-D
Database Group@CSE
Qualitative Analysis
• Classification Experiment• Testing data: two classes, one is a triplet set of
the form <A, B0, B1> and the other is a triplet set of the form<A, B1, B0>
• A classifier: it tries to identify the true neighbors.• Measure: <False positive rate, True positive
rate>• Data sets: Protein-protein interaction network
DBLP Co-authorship network
Database Group@CSE
Results
Database Group@CSE
ObservationMedian-D
• Considering a new probability distribution
• The below lemma could be achieved
D is a distance value
Database Group@CSE
Core Pruning Scheme
• Query Transformation
d D, M(s, t1) < d D, M(s, t2) => d M(s, t1) < d M(s, t2)
d M(s, t1) >= d M(s, t2) => d D, M(s, t1) >= d D, M(s, t2)
Database Group@CSE
Median-D kNN Query Answering
Database Group@CSE
Majority-D kNN Query Answering
• The condition of d which is the exact majority distance should be Pr(d) >= 1 – P, P denotes the sum of visited nodes’ probabilities.
• For the node which enters the kNN-set could be possibly replaced by another node with smaller majority distance at a later step.
Database Group@CSE
Experimental Results
• Dataset overview • Convergence of D-F
Using the distance of a sample of 500 pws as the ground truth
Database Group@CSE
Efficiency of k-NN Pruning
The fraction of visited nodes(pruning efficiency) as a function of k
Pruning efficiency as a function of sample size
Database Group@CSE
Quality of Results
Pruning efficiency as a function of edge probability
Median-D
Stability as a function of the number of possible worlds