Download - Database Group@CSE k-Nearest Neighbors in Uncertain Graphs Lin Yincheng 2011-02-28 VLDB10

Database Group@CSE

k-Nearest Neighbors in Uncertain Graphs

Lin Yincheng2011-02-28

VLDB10

Database Group@CSE

Outline

• Background• Motivation• Problem Definition• Query Answering Approach• Experimental Results

Database Group@CSE

Background

k-Nearest Neighbors Uncertain Graphs

15

15

55

5

Find out 2-nearest neighbors for vertex B

Database Group@CSE

Motivation

Distance Path Probability

5 B-D 0.3

20 B-A-DB-C-D

0.25648

∞ No path 0.44352

• Define meaningful distance functions which is more useful to identify true neighbors

• Introduce a novel pruning algorithm to process knn queries in uncertain graphs.

15(0.2)

15(0.6)

5(0.7)5(0.3)

5(0.4)

most-probable-path-distance

Database Group@CSE

Problem Definition

• Assumption: Independence among edges• Probabilistic Graph Model G(V, E, P, W)

• V and E denote the set of nodes and edges respectively;

• P denotes the probabilities associated with each edge;

• W assigns each edge with a weight

• k-NN Query

Database Group@CSE

Distances

• Median-Distance(s, t)

• Majority-Distance(s, t)

• Expected-Reliable-Distance(s, t)

Database Group@CSE

Challenges

• For computation of median-distance and majority-distance, we need to obtain their distributions over all possible worlds.

• For computation of expected-reliable-distance, it has been proved as a #P hard problem.

Database Group@CSE

Sampling

Database Group@CSE

Sample Size for Median-D

Database Group@CSE

Sample Size for E-R-D

Database Group@CSE

Qualitative Analysis

• Classification Experiment• Testing data: two classes, one is a triplet set of

the form <A, B0, B1> and the other is a triplet set of the form<A, B1, B0>

• A classifier: it tries to identify the true neighbors.• Measure: <False positive rate, True positive

rate>• Data sets: Protein-protein interaction network

DBLP Co-authorship network

Database Group@CSE

Results

Database Group@CSE

ObservationMedian-D

• Considering a new probability distribution

• The below lemma could be achieved

D is a distance value

Database Group@CSE

Core Pruning Scheme

• Query Transformation

d D, M(s, t1) < d D, M(s, t2) => d M(s, t1) < d M(s, t2)

d M(s, t1) >= d M(s, t2) => d D, M(s, t1) >= d D, M(s, t2)

Database Group@CSE

Median-D kNN Query Answering

Database Group@CSE

Majority-D kNN Query Answering

• The condition of d which is the exact majority distance should be Pr(d) >= 1 – P, P denotes the sum of visited nodes’ probabilities.

• For the node which enters the kNN-set could be possibly replaced by another node with smaller majority distance at a later step.

Database Group@CSE

Experimental Results

• Dataset overview • Convergence of D-F

Using the distance of a sample of 500 pws as the ground truth

Database Group@CSE

Efficiency of k-NN Pruning

The fraction of visited nodes(pruning efficiency) as a function of k

Pruning efficiency as a function of sample size

Database Group@CSE

Quality of Results

Pruning efficiency as a function of edge probability

Median-D

Stability as a function of the number of possible worlds

Download - Database Group@CSE k-Nearest Neighbors in Uncertain Graphs Lin Yincheng 2011-02-28 VLDB10

Top Related