probabilistic similarity query on dimension incomplete data

26
Wei Cheng 1 , Xiaoming Jin 1 , and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2 ICDM 2009, Miami

Upload: horace

Post on 02-Feb-2016

15 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic Similarity Query on Dimension Incomplete Data. Wei Cheng 1 , Xiaoming Jin 1 , and Jian-Tao Sun 2. Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2. ICDM 2009, Miami. Outline. Motivation & Problem Our Solution Experiments - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Similarity Query on Dimension Incomplete Data

Wei Cheng1, Xiaoming Jin1, and Jian-Tao Sun2

Intelligent Data Engineering Group, School of Software, Tsinghua University1

Microsoft Research Asia2

ICDM 2009, Miami

Page 2: Probabilistic Similarity Query on Dimension Incomplete Data

OutlineMotivation & Problem

Our Solution

Experiments

Related Work

Summary and Future Work

Page 3: Probabilistic Similarity Query on Dimension Incomplete Data

MotivationMultidimensional data are everywhere

Time series stock data data collected from sensor monitor

Feature vectors extracted from images or texts……

Similarity query on multidimensional data is importantdata miningdatabase information retrieval

Page 4: Probabilistic Similarity Query on Dimension Incomplete Data

Similarity query is challenging when the data is incompleteData incompleteness happens when:

Sensors do not work properlyCertain features are missing from particular

feature vectors…….

XXSensor data

… …… …

2 3 12 … …

11Text vector C1 4 Y 9 … …

Image vector Z 2 5 11 … …

Query

In order to process similarity query, imputation is necessary. (i.e. by “completing” the missing data by filling in specific values)

Page 5: Probabilistic Similarity Query on Dimension Incomplete Data

Dimension incomplete dataDimension incomplete data satisfies:

(a) At least one of its data elements is missing; (b) The dimension of the missing data element

can not be determined.E.g.

Observed data:

But we know the complete data should be of three dimensions

Data missing might happen on the first, second or third dimension.

3 6( )

Page 6: Probabilistic Similarity Query on Dimension Incomplete Data

Causes of dimension incompleteDimension incompleteness happens when:

Data missing happens while using the order as the implicit dimension indicator

The dimension indicator itself may also be lost……

Page 7: Probabilistic Similarity Query on Dimension Incomplete Data

Similarity query is more challenging when the dimension is incomplete

To measure the similarity between query and the dimension incomplete data object, we should first recover the incomplete data.

Enumerating all combination cases? – Time costingE.g. Xobs : 3 6 lost one dimension

3 possible results after data recovery

)

( )3 6( )3 6( )

3 6( )

XX

XX

XX

Imputed element

For an m-dimensional data

object which has n elements missing, there will be Cm

n cases to recover it.

Page 8: Probabilistic Similarity Query on Dimension Incomplete Data

Problem statement:, ( , , , )

{ | [ ( , ) ] }obs

PSQ DID D Q r c

X D P Q X r c

Page 9: Probabilistic Similarity Query on Dimension Incomplete Data

Two assumptions:The probability of using each recovery result is

equal.

The missing values obey normal distribution.

| || |

[ ( , ) ][ ( , ) ] rv

mis

rvX

XQ

P Q X rP Q X r

C

Page 10: Probabilistic Similarity Query on Dimension Incomplete Data

Efficient approach for PSQ-DIDA gradual refinement search strategy including

two pruning methods:Lower/upper bounds of confidenceProbability triangle inequality

Our Overall Query Process

Page 11: Probabilistic Similarity Query on Dimension Incomplete Data

Lower and upper bounds of confidenceThe missing part and the observed part of the

dimension incomplete data are treated separately. Since we use Euclidean distance, we have:

2 2 2rv obs obs mis mis(Q, X ) (Q , X ) (Q , X )

Lower/upper bounds of the observed part, denoted by δLBobs

and δUBobs.

Lower/upper bounds of the missing part, denoted by δLBmis

and

δUBmis.

( , )| | | |

( , )minobs obs

obs obs

LB Q X obs obsQ X

Q X

( , )

(arg min { ( , ( )) || | | |}, )mis mis

mis

LB Q X

Q mis mis mis mis misQ E X Q X X

Page 12: Probabilistic Similarity Query on Dimension Incomplete Data

E.g.Xobs=(2,8,7), Q=(1,4,5,6,7)

δ2LBobs

(Q, Xobs)=(2-1)2+(8-6)2+(7-7)2 = 5

corresponding recovery version: (2,8,7,x1,x2)

For the imputed random variables Xmis={x1,x2}, If the imputation policy is using the mean value of the two adjacent observed elements as the expectation of the imputed random variables, then

δ2LBmis

(Q , Xmis )=(4-x1)2+(5-x2)2,(E(x1)=E(x2)=5),

corresponding to Xrv =(2, , , 8, 7).

5 5

Page 13: Probabilistic Similarity Query on Dimension Incomplete Data

Denoted by: ,

2 2 2[ ( , ) ] [ ( , ) ( , ) ]mis obsUB mis UB obsP Q X r P Q X Q X r

2 2 2[ ( , ) ] [ ( , ) ( , ) ]mis obsLB mis LB obsP Q X r P Q X Q X r

Lower and upper bounds of confidence

2 ( , )LB Q X 2 ( , )UB Q X

We prove that

Page 14: Probabilistic Similarity Query on Dimension Incomplete Data

Probability triangle inequalityGiven a query Q and a multidimensional data

object R (|Q| = |R|). For a dimension incomplete data object Xobs whose underlying complete version is X, we have:

(1)

(2)

[ ( , ) ] [ ( , ) ( , ) ]LBP Q X r P R X Q R r

[ ( , ) ] [ ( , ) ( , ) ]UBP Q X r P R X Q R r

Calculated in advance and stored in the

database O(|Xobs|(|Q|-|

Xobs|)2)

Calculated during query processing

O(|Q|)

Page 15: Probabilistic Similarity Query on Dimension Incomplete Data

ExperimentsData sets:

Standard and Poor 500 index historical stock data(S&P500) (251 dimensions) A new data set with 30 dimensions

by segmenting the S&P500 data set, resulting in 4328 data objects.

Corel Color Histogram data (IMAGE) 68040 images 32 dimensions

Dimension incomplete data set:randomly removing some dimensions of each data

object.

Page 16: Probabilistic Similarity Query on Dimension Incomplete Data

Experiment SetupGround truth:

Similarity query results on the complete dataPerformance measures

Precision, recall, pruning powerPruning power=Ndefinite/Nprocessed

Nprocessed : number of all data objects

Ndefinite: number of data objects judged as dismissals or search results by the pruner.

Query: 100 data objects randomly sampled from the data set

Page 17: Probabilistic Similarity Query on Dimension Incomplete Data

Effectiveness of probabilistic similarity query on dimension incomplete data

Query precision on S&P500 data set

Query recall on S&P500 data set

Page 18: Probabilistic Similarity Query on Dimension Incomplete Data

Effectiveness of probabilistic similarity query on dimension incomplete data

Query precision on IMAGE data set

Query recall on IMAGE data set

Page 19: Probabilistic Similarity Query on Dimension Incomplete Data

Effect of the confidence thresholdMissing ratio=0.1; r=60 for S&P500, r=0.7 for IMAGE data

Confidence threshold vs precision-recall

Page 20: Probabilistic Similarity Query on Dimension Incomplete Data

Effectiveness of different pruners

Pruning power of probability triangle inequality

Page 21: Probabilistic Similarity Query on Dimension Incomplete Data

Pruning Power of Four PrunersPruner1: probability triangle inequality using confidence lower

bound confidence; Pruner2: probability triangle inequality using confidence upper bound confidence; Pruner3: confidence lower bound; Pruner4: confidence upper bound

missing ratio=10%, c= 0.1, number of assistant objects=20

Pruning power of four pruners

Page 22: Probabilistic Similarity Query on Dimension Incomplete Data

Comparison of query quality when neglecting naïve verificationFor data objects that the four pruners can not judge,

Pos simply outputs as query results, Neg, by contrast, judges them as dismissals.

c=0.1

Comparison of query quality

Page 23: Probabilistic Similarity Query on Dimension Incomplete Data

Performance analysis

Time cost

Page 24: Probabilistic Similarity Query on Dimension Incomplete Data

Related WorkFew research papers discuss similarity search

on dimension incomplete dataIncomplete data

Recovery D. Williams et al. [ICML’05], K. Lakshminarayan et

al. [Applied Intelligence’99],…Indexing

G. Canahuate et al. [EDBT’06], B. C. Ooi et al. [VLDB’98],…

Uncertain dataJ. Pei et al.[Sigmod’08], D. Burdick et al.

[VLDB’05],…Dimension incomplete data

Symbolic sequences J. Gu et al. [DEXA’07]

Page 25: Probabilistic Similarity Query on Dimension Incomplete Data

Summary and Future WorkProblem:

Tackle the similarity query on a new uncertain form (dimension incomplete)

Solution:Lower and upper bounds of confidence

So that we can avoid enumerate all C|Q||Xmis| recovery cases

Probability triangle inequality Further boost the performance in query processing

procedureFuture work

Other similarity measurementsIndex dimension incomplete data

Page 26: Probabilistic Similarity Query on Dimension Incomplete Data

Many thanks!