probabilistic queries and uncertain data

69
Probabilistic Queries Probabilistic Queries and Uncertain Data and Uncertain Data Sunil Prabhakar Department of Computer Sciences Purdue University Email: sunil@cs . purdue . edu http://www.cs.purdue.edu/homes/sunil

Upload: varick

Post on 31-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic Queries and Uncertain Data. Sunil Prabhakar Department of Computer Sciences Purdue University Email: [email protected] http://www.cs.purdue.edu/homes/sunil. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Queries and Uncertain Data

Probabilistic Queries and Probabilistic Queries and Uncertain DataUncertain Data

Sunil Prabhakar

Department of Computer Sciences

Purdue University

Email: [email protected]

http://www.cs.purdue.edu/homes/sunil

Page 2: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 2

Introduction

The traditional database model expects data items to be modeled as sets (bags) of tuples consisting of precise attribute values.

However, real-world data does not easily fit into this model if there is uncertainty in the information.

Uncertainty comes from many sources: unreliable measurements and data sources, incomplete or missing information, irreconcilable facts, …

This problem has been recognized for a long time (e.g. NULL values) and numerous models have been proposed.

Page 3: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 3

Introduction

Long history of ideas for incorporating uncertain data in databases

Many proposals for models Recent renewed interest in the area Some initial work on developing systems This tutorial provides a sampling of the area. More information at

http://www.cs.purdue.edu/homes/sunil

Page 4: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 4

Outline

Motivating examples Proposed Models Implementation issues

Efficiency Scalability Prototypes

Open problems References

Motivating examples

Page 5: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 5

Application: Sensor databases

External Environment e.g., temperature, moving objects,

hazardous materials

External Environment e.g., temperature, moving objects,

hazardous materials

sensor

sensor sensor

sensor

DatabaseSystem

NetworkNetworkChannelChannel

user

queries results

Page 6: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 6

Data uncertainty

Due to limited network bandwidth and battery power, readings are sampled

The value of the entity being monitored (e.g., temperature, location) is changing

Most of the time the database stores old values

Query results can be incorrect!

Page 7: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 7

Answering a Minimum Query

Database: X Correct answer: Y

x y

x0

x1

y0

y1

Recorded Temperature

Current Temperature

0

oF

10

20

30

Page 8: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 8

Bounding Uncertainty with Dead-Reckoning

Data values cannot change drastically The system negotiates a bound d with the sensor

System

sensor(v, d)

[v-d,v+d]

v

Trade-off between data uncertainty and update frequency

Page 9: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 9

Answering Minimum Query with

Error-Bounded Readings

x certainly gives the minimum temperature reading

Recorded Temperature

Bound for Current Temperature

x y

x0

y0

0

oF

10

20

30

Page 10: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 10

Answering Minimum Query with

Error-Bounded Readings

Recorded Temperature

Bound for Current Temperature

x y

x0

y0

0

oF

10

20

30

How do we determine the answer to this query?

Each sensor has some chance of given the minimum reading.

Probabilistic Queries

uncertaintypdfuncertaintypdf

Page 11: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 11

Probabilistic Queries

As attribute values become uncertain (actually, imprecise), operators (e.g =, <,>) over these data need to be defined.

These operators may no longer return Boolean results. Instead, given the probability distributions, they can return probabilistic answers

Page 12: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 12

Answering Minimum Query with

Error-Bounded Readings

Recorded Temperature

Bound for Current Temperature

x y

x0

y0

0

oF

10

20

30

((XX,,0.70.7), (), (YY,,0.30.3)) Answers augmented with

probabilistic guarantees

Page 13: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 13

Sensor Errors

In the previous examples, uncertainty was introduced in order to avoid incorrect results

Uncertainty may be inherent due to measurement errors, e.g. Most scientific instruments have well known errors GPS has a Gaussian distribution Micro-array data have a Lorentzian distribution Statistical results also have margins of error

Similar to previous case

Page 14: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 14

Data Privacy

Uncertainty may sometimes be desirable in order to provide privacy for individuals.

Instead of reporting an exact location to a Location-Based service provider, users can obfuscate their location to a small spatial region.

This naturally results in ambiguity (uncertainty) in query results.

Page 15: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 15

Application: Protein Annotation Consider a protein database that records the functions of

the proteins (annotations). Some function information is experimentally derived and

has high confidence (certainty). More often, annotations are transferred based upon

computational results HMMs Sequence similarity Rule bases

Such annotations are inherently less reliable. As these annotations propagate, so do the errors. It is desirable to be able to capture the uncertainties in the

annotations within the database.

Page 16: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 16

Application: Text Retrieval

In text retrieval systems, answers to queries are typically inexact.

For example, “Find documents on uncertain data management”

Results are ranked in order of relevance to the query

Thus, the answer can be viewed as having a probability of being part of the result relation

When multiple conditions are tested -- how do we combine these rankings?

Probabilistic modeling can help in this situation.

Page 17: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 17

Application: Data Integration & Cleaning When integrating multiple database, it is

necessary to identify matches between tuples For many pairs, there is no clear Yes/No

answer to the matching question Existing methods can provide a probability or

degree of match which can be exploited in an application-specific manner.

How should these uncertainties in the result of cleaning or integration be handled?

Page 18: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 18

Unreliable Sources, Missing Data Consider the following cases:

Information received from certain sources may not be entirely reliable (compromised sensors, poor quality of data, …).

Information from multiple sources may be inconsistent, even contradictory.

An attribute’s exact value may not be known, but it can be only one of few possibilities.

Each of these cases are examples where the data is uncertain.

Page 19: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 19

Application Needs

In summary, we see that there are numerous applications for which uncertainty in data is either inherent or desirable.

Existing systems do not provide any support for uncertain data thereby compelling applications to morph their data to fit the model.

There is a real need for the development of database systems that handle uncertain data.

The characteristics of uncertainty are diverse and often application-dependent.

Page 20: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 20

Outline

Motivating examples Proposed Models Implementation issues

Efficiency Scalability Prototypes

Open problems References

Page 21: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 21

Uncertain Data Models

There have been numerous proposal for models. Some distinguishing features include: Nature of uncertainty (probabilitic, …) Types of databases (Relational, XML,…) Complexity of uncertainty

Granularity of uncertainty Handling correlations Handling missing data Types of uncertainty supported

Page 22: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 22

Types of uncertainty models

Qualitative models NULL values Definite, Indefinite, or Maybe [LS87,LS91]

Quantitative models Probabilistic Dempster-Shafer (evidence-based) [LSS96, Lee92] Fuzzy sets (possibilities) [CUP06]

Page 23: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 23

Probabilistic Models

There are two main types of probabilistic data uncertainty addressed in recent work: Attribute uncertainty

The value of an attribute of a tuple is not known precisely

Modeled as a set or range of possible values with associated probabilities

Tuple uncertainty The membership (presence) of an entire tuple within a

relation is uncertain Maybe modeled as an probability attached to the tuple.

Page 24: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 24

Other Models

Some systems consider both types ([GUP06]) Table uncertainty has also been proposed to

handle coverage of a table (what percentage of tuples are present in the table) [Wid05].

Probabilistic database in semi-structured model XML data (Nierman & Jagadish) [NJ02] Acyclic data structure (Hung,Getoor & Subrahmanian)

[HGS03] Fuzzy databases [GUP06] (possibility values) Uncertainty in Deductive Databases [LS97,LS01,LS03]

Page 25: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 25

Tuple Uncertainty

There has been a significant amount of work in this domain dating back (at least) to 1979.

The basic idea is that the membership of a tuple in a relation is not certain.

This uncertainty may reflect the degree of confidence that this tuple belongs to the relation or the degree of relevance of the tuple to the relation (a query answer).

Page 26: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 26

Some Tuple Uncertainty Models Cavallo and Pittarelli [CP87] Fuhr and Roellke [RK97] Fuhr [Fuhr95] Dey and Sarkar [DS96] TRIO [Wid05]

Page 27: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 27

Fuhr [FR97,Fuhr90,Fuhr95] Input relations are assumed to have attributes that

have probabilistic events associated with them. These are assumed to be independent The evaluation of queries results in new tuples with

complex events associated with them. These tuples may no longer be independent thus

causing complications. Fuhr solves this problem using intensional semantics

-- for each tuple, the complex event is derived. In the final step the probability value of this event is computed.

This is very expensive and complicated.

Page 28: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 28

Dalvi & Suciu [DS04, DS05]

Dalvi and Suciu explore extensional evaluations -- the probability values of tuples after the application of operators are computed.

However, this can lead to incorrect results in some cases. Notion of safe query plans.

An algorithm to identify a safe extensional plan for a query is developed. May not always return a result.

Heuristic plans and approximations are proposed for the case where the data complexity of the query is #P-complete.

[DS05] addresses the case where input relation tuples are not independent.

Page 29: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 29

Information Source Tracking Fereidoon Sadri [FS91, FS95] Sources of data are assigned a reliability Query answers and derived data are also assigned a

score that can be computed Each tuple is assigned a propositional formula that

describes its certainty (in terms of the reliability of sources) -- vectors

Sources are assumed to be independent Computing a query implies computing the vectors for

each tuple and then computing the corresponding certainty -- requires certainty of sources

Page 30: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 30

Information Source Tracking (Cont.) Possible worlds semantics: k sources, 2k possible

relations Provided definitions of extended operators that

guaranteed Soundness and completeness: I.e. the result of these operators over uncertain relations had the same set of possible words as applying regular relational operators over the possible worlds of the input relations

Efficiency concerns due to large size of pwd. Algorithms for aggregations also developed, but

mostly expensive or NP-Complete

Page 31: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 31

Attribute Uncertainty

The earliest example of work in this area is the notion of NULL values (Codd)

The probabilistic data model (PDM) proposed in [BHP92] -- focus on discrete values

ProbView [LLR+97] Continuous attribute case proposed for

sensor data [CKP03]

Page 32: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 32

Codd’s model for uncertainty

NULL values are a means of capturing uncertainty with three-valued logic (T,F,M)

A-mark and I-mark also introduced along with a four-valued logic (T, F, A, I)

A-mark implies that the attribute value exists, but is not known.

I-mark implies that the attribute value is undefined, or does not exist.

Page 33: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 33

Probabilistic Data Model Barbara, Garcia-Molina, Porter [BGP92] Discrete attribute uncertainty Key attributes are deterministic (precise) Notion of attribute groups (handles dependent data) Captures missing probability (no assumption) Probabilities may be user defined, statistically

determined, due to staleness, etc. STUDENT GPA INTEREST ACC_EVAL

Adam 3.8

0.7[theory] 0.6[Y A]

0.3[*] 0.1[N A]

0.3[* *]

Page 34: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 34

Probabilistic Data Model (cont.) Selects can refer to attributes or probabilities Selection conditions specify cutt-off probabilities

Two flavors -- must and maybe (with or without the missing probability)

SELECT APPLICANTS WHERE ACC_EVAL: V = [Y, *], P > 0.7 (Adam not in result -- Must semantics)

SELECT APPLICANTS WHERE ACC_EVAL: v = [Y, *], p > 0.7 (Adam in result -- Maybe semantics)

Natural joins allowed where join attribute must be key for one of the relations (not commutative)

Project similarly defined for dropping attributes from groups Studied impact of missing probabilities on joins -- may lead to

loss of information.

Page 35: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 35

Probabilistic Data Model (contd.) New operators:

-SELECT, -Join: Based upon similarity of probability distributions

STOCHASTIC: convert regular relation to probabilistic based upon given schema (freq gives probability)

DISCRETE: convert probabilistic relation to a regular relation (based upon expected values)

GROUP: merge two or more attribute groups into one

Page 36: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 36

ProbView [LLR+97] Attribute values specified as alternative discrete values

with probability intervals. Attribute uncertainty is converted to tuple uncertainty. Possible worlds are derived from this set with upper

and lower bounds on probabilities. Annotated relations obtained by flattening probabilistic

relations with path (expressions on worlds) Computing probabilities for queries is done via user-

specified functions. Relational algebra operations are extended to handle

the probability bounds and paths.

Page 37: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 37

Continuous Attribute Uncertainty

Cheng, Kalashnikov, Prabhakar [CKP03a, CKP04] Allow an attribute value to be a continuous range with an

associated probability density function The cumulative probability over the interval should be 1 General continuous attribute uncertainty model Covers models used in various application domains, e.g.,

location uncertainty [WSCY99, PJ99] DNA microarray data error [BWW+02]

ffii((xx) – uncertainty pdf) – uncertainty pdf

[L R]uncertainty interval

Page 38: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 38

Probabilistic Nearest Neighbor Query At distance r, A is the

nearest neighbor of Q if: A is at distance r from Q B,C,D are all located at

distances > r from Q. The pdf pA(r) can be

computed.

A

B

C

D

r

Page 39: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 39

Probabilistic Nearest Neighbor Query Compute pA(r)

From the shortest distance of A to Q (nA)

To the longest distance of A to Q (fA)

A

B

C

D

Q

∫=A

A

f

n AA drrpP )(

nA

fA

Page 40: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 40

Classification of Probabilistic ResultsFour classes of queries identified [CKP03b]

1. Nature of result values Continuous: returns a single value

e.g., Average query ([l,u], pdf) Discrete: returns a set of objects

e.g., Range query ({(Ti,pi), pi>0})2. Relationship between result values

Independent: whether an object satisfies a query is independent of others e.g., Range query

Interdependent: interplay between objects decides result e.g., Nearest-Neighbor query

Page 41: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 41

Classification of Probabilistic Queries

Continuous Discrete

IndependentWhat is the temperature of sensor x? Which sensor has temp between

10oF and 30oF?

Inter-dependent

What is the average temperature of the sensors?

Which sensor gives the highest temperature?

The notion of query answer quality was also introduced.For each class of queries, a metric for query quality was specified.Intuitively, this metric captures the degree of uncertainty in the answer

(as compared to an answer derived over precise data).

Page 42: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 42

Quality of Probabilistic Result

Probabilistic queries: notion of result "quality" Example: range query (is Ti.z in range [l, u]?)

regular range query "yes" or "no"

probabilistic range query

5.0

|5.0| −= ip

Score

∑∈

−=

Ri

ip

RERQanofScore

5.0

|5.0|

||

1___

l u

a)b)

c)

Page 43: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 43

Quality for Continuous-Interdependent Queries Query result: [l,u], {p(x) : x [l,u]}

U[3,4] less ambiguous than U[1,100] Differential entropy

Measures uncertainty associated with r.v. X with pdf p max(H(X)) = log2(u-l) iff X~U[l,u] (most uncertain)

Metrics for other classes also proposed.

∫−=u

l

dxxpxpXH )(log)()( 2

)(____ XHQueryAggrValueofScore −=

Page 44: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 44

Outline

Motivating examples Proposed Models Implementation issues

Efficiency Scalability Prototypes

Open problems References

Page 45: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 45

Implementation Challenges

Many proposals have not addressed the issues of implementation

Some models are known to be very expensive computationally, e.g. the model proposed in [FR97].

Is it possible to avoid enumeration of all possible worlds in order to compute queries?

Notion of safe queries and extensional evaluation [DS04].

Page 46: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 46

Extensional Semantics [DS04] Intensional evaluation is very expensive. Propose new extensional evaluation where

probabilities are continuously maintained. Can lead to incorrect results -- develop the notion of

safe extensional plans based upon PWD semantics. Extensional plans not always available. Some heuristics have been proposed. Can one do better? Work done in the context of queries with uncertain

predicates (information retrieval). What about other domains?

Page 47: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 47

{({(TT11,,00..22),(),(TT22,,0.80.8)})}

T1 T2

0

10

20

30

oF

Recorded Temperature

Uncertainty for Current Temperature

Orion Query Evaluation [CKP03]

p1 = f1(z)dz10

12

∫ ∫=25

15 22 )( dzzfp

Probabilistic Range Query example

Page 48: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 48

Probabilistic Threshold Range Query (PTRQ) Users are likely to be concerned with results that meet a

given cutoff probability. Retrieve sensor ids with readings between 10oF to 25oF

with probability ≥ 0.7 PTRQ: Given [a,b] and p, return {Ti} where Prob(value

of Ti is inside [a,b]) ≥ p How to exploit indexes for such queries?

n Use R-tree or interval index [AV96, KRVV96, MTT00] to find intervals intersecting [a,b]

n For each object retrieved, evaluate its probability of being within [a,b]. Return objects with probability ≥ p

Page 49: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 49

Problem with Current Indexes

Current Interval indexes do not consider probabilities during search

Many irrelevant objects (probability < p) may be processed.

New indexes for probabilistic data. Orion [CXP+04]: Probability Threshold Indexing (PTI)

1D interval R-tree with uncertainty Variance-based Clustering

Transform intervals to 2D points and index based on variance

Page 50: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 50

Pruning in a 1D R-Tree

Q (Q (p = p = 0.3)0.3)

a b

•Some intervals in the MBR may satisfy Q•Need to retrieve the contents of the MBR and evaluate

Page 51: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 51

x-bounds in a PTI Node

left-0.2-bound

≥ 0.8

0.2

right-0.2-bound

f i(y)dy ≤ 0.2Li

left−0.2−bound

Page 52: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 52

x-bounds in a PTI Node

left-0.3-bound right-0.3-boundleft/right-0.5-bound right-0.2-boundleft-0.2-boundleft-0-bound (MBR) right-0-bound (MBR)

Page 53: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 53

Pruning with x-bounds

left-0.2-bound right-0.2-bound

Q (Q (p = p = 0.3)0.3)

a b

Q (Q (p = p = 0.3)0.3)

a b

An MBR is not retrieved if there exists an x-bound p > x b on the left of left-x-bound

An MBR is not retrieved if there exists an x-bound p > x a on the right of right-x-bound

Page 54: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 54

Drawback of PTI

Extra overhead in storing x-bounds Small intervals near edges limit gains

left-0.2-bound right-0.2-bound

Page 55: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 55

Clustering 2D points

Points in the same vicinity have similar means and variances

mean of [Li,Ri]

variance of [Li,Ri]

(Li,Ri)

x=Li

y=Ri

x=y

cluster of large intervals

cluster of smaller intervals

When 2D points are clustered, intervals of different variances are separated

Points clustered based on means and variances (variance-based clustering)

Page 56: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 56

Answering PTRQ with 2D R-Tree

Construct a R-tree over 2D points transformed from the intervals

Convert PTRQ to a 2D-range query Query the 2D R-Tree

Page 57: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 57

Querying Uniform pdf

(Li,Ri)Q (p = 0.75)

Li Ri

a b

1D View(Uniform pdf)

x =Li

y = Ri

2D View

x=y

a b

a

b

y(1-p)+xp ≥ aIntervals containing a

a <x < y < bIntervals in [a,b]x(1-p)+yp b

Intervals containing bb-a ≥ p(y-x)

Intervals containing [a,b]

Page 58: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 58

Implemented Systems

U. Washington Tuple-uncertainty Built as a layer over SQLServer 2000 Evaluation of similarity queries over certain data.

Orion (Purdue) Attribute uncertainty Extension of PostgreSQL

Defines new uncertain data types, and operators Boolean operations over uncertain data (thresholds) http://orion.cs.purdue.edu/

Page 59: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 59

Orion Prototype

A system for handling uncertain data Meta-queries for specifying data uncertainty (e.g.,

uncertainty interval, type of uncertainty pdf,) Extension of SQL operators to support different

probabilistic query classes Measurement of probabilistic answer quality Allows easy addition of new uncertain data types

(e.g., uncertain pdf) and query operators

Page 60: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 60

Example Queries

Create a table with UNCERTAIN type CREATE table T(

k INTEGER primary key,

a UNCERTAIN);

Insert Gaussian pdf (μ,σ) Insert into T values (1,‘(g,μ,σ)’);

Display uncertain info. of a if a > 5 SELECT a FROM T where a > 5;

Equality join of uncertain attributes (=% returns probability of equality)

SELECT R.k, S.k, R.a =% S.a

FROM R,S

WHERE R.a = S.a;

Entities with prob. giving min value of a

(e.g., {(3,0.5), (5,0.3), (11,0.2)}

SELECT Emin(T.a) from T;

Min value of a for table T (UNCERTAIN) SELECT Vmin(T.a) from T;

Page 61: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 61

Outline

Motivating examples Proposed Models Implementation issues

Efficiency Scalability Prototypes

Open problems References

Page 62: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 62

Models

A large number of models have been proposed. Some are subsumed by others.

Still unclear which is the best model (if any). What model should be used for what

applications? What is the nature of uncertainty for

important classes of applications? Which model(s) are applicable? Mapping model to user notions.

Page 63: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 63

Model issues

Models What types of uncertainty does a model provide? Is the model complete? Closed? Query semantics for a given model How to handle missing data? Correlations? Models for specific domains? User interpretation and understandability.

Page 64: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 64

Implementation issues

How should uncertainty be represented in the system? Efficient algorithms for query evaluation.

Operators over uncertain data. New types of queries. Index structures for uncertain data.

Query optimization Should we approximate? Threshold queries?

How should probabilities (uncertainties) be attached to data? Query language extensions. User-interfaces -- how can users understand and control the

impact of uncertainty?

Page 65: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 65

References[AV96] L. Arge and J. S. Vitter. On dynamic interval management in external memory (extended

abstract). In FOCS, p. 560-569, 1996.[BGP92] D. Barbara, H. Garcia-Molina and D. Porter. The management of probabilistic data.

IEEE TKDE, 4(5):487-502, 1992.[BWW+02] J. Brody, B. Williams, B. Wold, and S. Quake Significance and statistical errors in

the analysis of DNA microarray data. Proc. Of the National Academy of Sciences, U S A., 2002, 1;99(20).

[CH89] C. Chatfield. The analysis of time series an introduction. Chapman and Hall, 1989. [CKP04] R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Querying imprecise data in moving

object environments. In IEEE TKDE, 2004.[CKP03b] R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over

imprecise data. In ACM SIGMOD 2003.[CPK03a] R. Cheng, S. Prabhakar, and D. V. Kalashnikov. Querying imprecise data in moving

object environments. In IEEE ICDE 2003.[CP04] R. Cheng and S. Prabhakar. Using Uncertainty to Provide Privacy-Preserving and High-

Quality Location-Based Services. In Workshop on Location Systems Privacy and Control, Mobile HCI’04.

[CXP+04] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter. Efficient indexing methods for probabilistic threshold queries over uncertain data. In VLDB 2004.

Page 66: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 66

References[DGM+04] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein and W. Hong. Model-Driven

Data Acquisition in Sensor Networks. In VLDB, 2004.[DGM05] A. Deshpande, C. Guestrin and S. Madden. Using Probabilistic Models for Data

Management in Acquisitional Environments. In CIDR, 2005.[DS04] N. Dalvi and D. Suciu. Efficient Query Evaluation on Probabilistic Databases. In VLDB

2004.[DS05] N. Dalvi and D. Suciu. Answering Queries from Statistics and Probabilistic Views. In

VLDB 2005.[FR97] N. Fuhr and T. Roelleke, A Probabilistic Relational Algebra for the Integration of

Information Retrieval and Database Systems, ACM Transactoins on Information Systems, 15(1): 32-66, 1997.

[Fuhr90] N. Fuhr. A Probabilistic Framework for Vague Queries and Imprecise Information in Databases. In VLDB, 1990.

[Fuhr95] N. Fuhr. Probabilistic Datalog Logic for Powerful Retrieval Methods. In Proc. Of ACM SIGIR, 1995.

[GUP06] J. Galindo, A. Urrutia, M. Piattini. Fuzzy Databases: Modeling, Design, and Implementation. Idea Group Publishing, ISBN: 1-59140-324-3

[HGS03] E. Hung, L. Getoor and V. S. Subrahmanian. PXML: A Probabilistic Semistructured Data Model and Algebra. In ICDE 2003.

Page 67: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 67

References[JSS94] S. Vrbsky and J.W.S. Liu. Producing approximate answers to set- and single-valued

queries. The Journal of Systems and Software, 27(3),1994. [KRVV96] P. C. Kanellakis, S. Ramaswamy, D. Vengroff, and J. S. Vitter. Indexing for data

models with constraints and classes. In J. Comp. Syst. Sci, 52(3):589-612, 1996. [KT01] S. Khanna and W.C. Tan. On computing functions with uncertainty. In 20th ACM

Symposium on Principles of Database Systems, 2001.[LCL+04] K.Y. Lam, R. Cheng, B. Liang and J. Chau. Sensor Node Selection for Execution of

Continuous Probabilistic Threshold Queries in Wireless Sensor Networks. In VSSN, ACM Multimedia 2004.

[Lee92] S. K. Lee, An extensional relational database model for uncertain and imprecise information. In Proc. Of VLDB, 1992.

[LLR+97] L. V. S. Lakshmanan, N. Leone, R. Ross, V. S. Subrahmanian: ProbView: A Flexible Probabilistic Database System. ACM Trans. Database Syst. 22(3): 419-469 (1997)

[LS87] K. C. Liu and R. Sunderraman. An Extension to the Relational Model for Indefinite Databases, Proceedings of the ACM-IEEE Computer Society Fall Joint Computer Conference, Dallas, Texas, Pages 428--435, 1987

[LS91] K.C. Liu and R. Sunderraman, A Generalized Relational Model for Indefinite and Maybe Information, IEEE Transactions on Knowledge and Data Engineering, Vol. 3, No. 1, Pages 65--77, 1991

Page 68: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 68

References[LS97] L. V. S. Lakshmanan, F. Sadri: Uncertain Deductive Databases: A Hybrid Approach. Inf.

Syst. 22(8): 483-508 (1997)

[LS01] L. V. S. Lakshmanan, F. Sadri: On a theory of probabilistic deductive databases. TPLP 1(1): 5-42 (2001)

[LS03] L. V. S. Lakshmanan, F. Sadri: On A Theory of Probabilistic Deductive Databases CoRR cs.DB/0312043: (2003)

[LSS96] Lim, Srivastava, and Shekhar, An Evidential Reasoning Approach to Attribute Value Conflict Resolution in Database Integration, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 5, 1996

[MTT00] Y. Manolopoulos, Y. Theodoridis, and V. J. Tsotras. Chapter 4: Access methods for intervals. In Advanced Database Indexing, Kluwer, 2000.

[NJ02] A. Nierman and H. V. Jagadish. ProTDB: Probabilistic Data in XML. In VLDB 2002.[PJ99] D. Pfoser and C. S. Jensen. Capturing the Uncertainty of Moving-Object Representations, in Proc. of the Sixth International Symposium on Spatio Databases, Hong Kong, July 20-23, 1999, pp. 111-132.

[SWC+98] P. A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao. Querying the uncertain position of moving objects. In Temporal Databases: Research and Practice. 1998.

Page 69: Probabilistic Queries and Uncertain Data

Sunil Prabhakar, Probabilistic Queries and Uncertain Data, COMAD 2005b 69

References [TWZ+02] G. Trajcevski, O. Wolfson, F. Zhang and S. Chamberlain. The Geometry of

Uncertainty in Moving Objects Databases. In EDBT 2002. Springer LNCS 2287, pp. 233-250.[Wid05] J. Widom. Trio: A system for integrated management of data, accuracy and lineage. In

CIDR, 2005. [WSCY99] O. Wolfson, P. Sistla, S. Chamberlain, and Y. Yesha. Updating and querying

databases that track mobile units. Distributed and Parallel Databases, 7(3), 1999.