a framework for data quality aware query systems

19
A Framework for Data Quality Aware Query Systems Naiem, K. Yeganeh, Mohamed A. Sharaf School of Information Technology and Electrical Engineering The University of Queensland

Upload: pembroke

Post on 30-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

A Framework for Data Quality Aware Query Systems. Naiem, K. Yeganeh, Mohamed A. Sharaf School of Information Technology and Electrical Engineering The University of Queensland. Data Quality Aware Query System. Example: Virtual Shop. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Framework for Data Quality Aware Query Systems

A Framework for Data Quality Aware Query SystemsNaiem, K. Yeganeh, Mohamed A. Sharaf

School of Information Technology and Electrical Engineering

The University of Queensland

Page 2: A Framework for Data Quality Aware Query Systems

Data Quality Aware Query System Example: Virtual ShopGoogle product search returns 91345 results for “Cannon Powershot”.

Is user going to check all results to find the best source of information that matches his own requirements?

How can we help user to find what he really wants?

Page 3: A Framework for Data Quality Aware Query Systems

Data Quality Aware Query System Multiple Sources of InformationIn a virtual shop user query can be answered from various sources of information (virtual shops).

•Challenge is to find the best source(s) that satisfy user requirements on data quality.

•Because Data Quality = Fitness for Use

Page 4: A Framework for Data Quality Aware Query Systems

Data Quality Aware Query System Following Questions Should be Answered

1. How to measure the quality of data for each data source? -> DQ Profiling

2. How to model and capture user specific data quality preferences? -> User Preferences on DQ

3. How to conduct the data quality aware query and rank results to bring up data sources that satisfy user the most. -> DQ Aware Query Processing

Page 5: A Framework for Data Quality Aware Query Systems

Data Quality ProfilingData Quality Metrics (Dimensions) [Wang 1996]

Accuracy (Erroneous) Postcode “4107” is typed “4017”

Consistency (Inconsistent) ITEE Vs. Information Technology and Electrical Engineering

Completeness (Missing) Students don’t have to declare a major till graduation, so major is missing in

most enrolments Currency (Obsolete)

Old phone numbers Accessibility (Unavailable)

Server down, privacy concerns Reliability & Trust (Uncertainty)

[Wang 1996] R.Y. Wang and D.M. Strong. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 1996.

Page 6: A Framework for Data Quality Aware Query Systems

Data Quality Profiling Data Quality Profiling is the measurement of DQ

metrics. DQ Profiling in the literature is considered in

different granularities: Per Data Source Per Schema Object (relations, attributes) Per Query (Subsets of a Schema Object)

Page 7: A Framework for Data Quality Aware Query Systems

Data Quality Profiling Manual assignment of Information Quality to Data Sources [Neumann 1999]

Each data source is assigned a set of IQ scores which are mostly assigned by users. Figure below shows an example from [Neumann 1999].

S1..S5 are data sources, QCAs are quality correspondence assertions which are assigned by operators. EoY, Rep, … are IQ Metrics, i.e. Ease of Understanding, Reputation, etc.

QCAs are what we now call DQ Profile per Data Source

[Neumann 1999] Naumann, F. and Leser, U. and Freytag, J.C., Quality-driven integration of heterogeneous information systems, VLDB 1999

Page 8: A Framework for Data Quality Aware Query Systems

Data Quality Profiling Finer grained data quality metric [Mecalla 2003]

A tree where each node represents a schema object, e.g. Data Source, Relation, Column, and Data Quality Metric

[Mecalla 2003] ecella, M. and Scannapieco, M. and Virgillito, A. and Baldoni, R. and Catarci, T. and Batini, C., The DaQuinCIS broker: Querying data and their quality in cooperative information systems, Journal of Data Semantics, 2003

Page 9: A Framework for Data Quality Aware Query Systems

Data Quality Profiling A typical Data Quality Profile

Data Quality measurements are stored per schema object. Is unable to provide valid estimates for selection queries.Quality of information about Cannon products in a Sony website may not be good even if the web site has high quality data in general

Item Title Item Desc Num Available

Price Tax User Comments

S2IS Canon PShot 8 310 31

Camera 1000 Test

S3IS Canon PShot 375

XL-2 4 184 18DMC-TZ5K

Panasonic Lmx

340

DSC-W55 S ony Cshot 2 260 26

Object Metric Value

Shop.ItemTitle Completeness 0.83

Shop.ItemDesc Completeness 0.83

Shop.NumAvailable Completeness 0.50

Shop.Price Completeness 1.00

Shop.Tax Completeness 0.50

Shop.User Comments Completeness 0.16

Page 10: A Framework for Data Quality Aware Query Systems

Data Quality Profiling Data Quality Profiling for Selection Queries

B (Brand) M (Model) P (Price) I (Image)C S H 1.JpgC S H 2.JpgC S LC N H 4.JpgS S H 5.JpgS S H 6.JpgS S HS S L 7,jpgS N HS N L 8.JpgS N L

Brand:C = CannonS = SonyModel:S = SLRN = NormalPrice:H = HighL = Low

* Data quality is different for different selection queries.

* Naïve approach is to pre-compute each data quality metric for any possible selection condition.

* Search space will be exhaustive.

Page 11: A Framework for Data Quality Aware Query Systems

User Data Quality Preferences Preferences:

Preference as Partial Orders: Multi criteria decision making

Preference queries in Database Systems Data Quality Aware SQL Handing Inconsistency

Page 12: A Framework for Data Quality Aware Query Systems

User Data Quality Preferences Preferences:

User Preference is best modelled as sets of partial orders. [Saati 1995]

E.g. I prefer Tea over Coffee, then I prefer No Sugar over Sugar. Or: I prefer Price over Tax, then I prefer Accuracy over

Completeness, etc.

[Saaty 1006] T.L. Saaty. Multicriteria Decision Making: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. RWS Publications, 1996.

I have the following preference matrix about quality of the Price attributeI have the following preference matrix about quality of the Price attribute

Completeness Accuracy Consistency

Completeness - Very much preferred

Slightly preferred

Accuracy Very less preferred

- A little bit preferred

Consistency Slightly less preferred

A little bit less preferred

-

Page 13: A Framework for Data Quality Aware Query Systems

User Data Quality Preferences Preference queries [Govindarajan 2000]

Model User Preference as SQL

SELECT X WHERE Q PREFER P1, P2

SELECT X WHERE Q PREFER maximum WRT p

Problems: Designed for deductive databases, not suitable for Data Quality

Preferences. Not utilizing partial order definition for preference.

[Govindarajan 2000] K. Govindarajan, B. Jayaraman, and S. Mantha. Preference Queries in Deductive Databases. New Generation Computing, 2000

Page 14: A Framework for Data Quality Aware Query Systems

Data Quality Aware SQL A SQL extension to query any metric in data quality profile as [Column

Name.Metric Name] as part of the SQL query formulation. [Yeganeh 2009]

SELECT Title, Price FROM ShopItem WHERE Title.Completeness>0.8

SELECT Title, Title.Accuracy, Price FROM ShopItem ORDER BY Price.Accuracy

ORDER BY Price.Accuracy models a one dimensional preference that indicates sources with higher price accuracy are preferred, a two dimensional preference can not be intuitively achieved.

[Yeganeh 2009] Yeganeh, N. and Sadiq, S. and Deng, K. and Zhou, X., Data quality aware queries in collaborative information systems, APWeb 2009

Page 15: A Framework for Data Quality Aware Query Systems

Data Quality Aware SQLPrioritized preferences (Utilize preference as partial order concept):

E.g. from the sources with highest data quality, sources with high currency of price are prioritized over sources with high completeness of price.

Hierarchy ClauseSELECT Title AS t, Price AS p, [User Comments] AS u

FROM ShopItem WHERE ...

HIERARCHY(ShopItem)

p OVER (t,u) 7, u OVER (t) 3

HIERARCHY(ShopItem.p)

p.Currency OVER (p.Completeness) 3

Generally: HIERARCHY(a) a.x OVER (a.x',...) n

Why Hierarchy?

Intuitively human defines preferences as partial orders (pairs).

E.g. I prefer cofee to tea.

Page 16: A Framework for Data Quality Aware Query Systems

Data Quality Aware SQL Preferences as Partial order can be Inconsistent

For example: I prefer tea to coffee, I prefer coffee to milk, I prefer milk to tea.

Visual feedback to help user define consistence preference Size of the circles represents weight of item. Color represents consistency of

preferences (e.g. darker color means possible inconsistency).

Automatically fix inconsistencieswhen possible.

[Yeganeh 2010] Yeganeh, N.K. and Sadiq, S., Avoiding Inconsistency in User Preferences for Data Quality Aware Queries, BIS 2010

Page 17: A Framework for Data Quality Aware Query Systems

DQ Aware Query Planning Select query plan that maximizes the quality of query results.

How to estimate qualityof each data sourcefor complex queriesI.e. joins, aggregate,etc.

Consideration of the quality of service metrics of each sourcebecomes necessary in addition to Data Quality

Data Quality of joins between different data sources is very hardto compute

Communication Infrastructure

S1 S2 S3 Sn

Select * from join A,B,C,D on ...

Querying Interface

S3 S5 Sk Sj

A B C D

Si S9 S4 Sn

Sx S1 Sy Sb

.. .. .. ..

Possible join plans

Page 18: A Framework for Data Quality Aware Query Systems

Putting all together DQAQS: Data Quality Aware Query System – A Data Quality Aware

Data Integration System Data Quality Services (DQS) Services to generate data quality profiles. Data Quality Agents (DQA) Workers that manage generation and maintenance of data

quality profiles. Data Quality Aware Mediator (DQM) A mediator which is able to comprehend the Data

Quality aware SQL and orchestrate the query execution (i.e. Data Quality Aware Query Planning)

DQA

DQA

Network / Cloud

S1

S2

S3

DQS

DQS

DQS

DQS

DQM

Page 19: A Framework for Data Quality Aware Query Systems

Questions?