a framework for data quality aware query systems
DESCRIPTION
A Framework for Data Quality Aware Query Systems. Naiem, K. Yeganeh, Mohamed A. Sharaf School of Information Technology and Electrical Engineering The University of Queensland. Data Quality Aware Query System. Example: Virtual Shop. - PowerPoint PPT PresentationTRANSCRIPT
A Framework for Data Quality Aware Query SystemsNaiem, K. Yeganeh, Mohamed A. Sharaf
School of Information Technology and Electrical Engineering
The University of Queensland
Data Quality Aware Query System Example: Virtual ShopGoogle product search returns 91345 results for “Cannon Powershot”.
Is user going to check all results to find the best source of information that matches his own requirements?
How can we help user to find what he really wants?
Data Quality Aware Query System Multiple Sources of InformationIn a virtual shop user query can be answered from various sources of information (virtual shops).
•Challenge is to find the best source(s) that satisfy user requirements on data quality.
•Because Data Quality = Fitness for Use
Data Quality Aware Query System Following Questions Should be Answered
1. How to measure the quality of data for each data source? -> DQ Profiling
2. How to model and capture user specific data quality preferences? -> User Preferences on DQ
3. How to conduct the data quality aware query and rank results to bring up data sources that satisfy user the most. -> DQ Aware Query Processing
Data Quality ProfilingData Quality Metrics (Dimensions) [Wang 1996]
Accuracy (Erroneous) Postcode “4107” is typed “4017”
Consistency (Inconsistent) ITEE Vs. Information Technology and Electrical Engineering
Completeness (Missing) Students don’t have to declare a major till graduation, so major is missing in
most enrolments Currency (Obsolete)
Old phone numbers Accessibility (Unavailable)
Server down, privacy concerns Reliability & Trust (Uncertainty)
[Wang 1996] R.Y. Wang and D.M. Strong. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 1996.
Data Quality Profiling Data Quality Profiling is the measurement of DQ
metrics. DQ Profiling in the literature is considered in
different granularities: Per Data Source Per Schema Object (relations, attributes) Per Query (Subsets of a Schema Object)
Data Quality Profiling Manual assignment of Information Quality to Data Sources [Neumann 1999]
Each data source is assigned a set of IQ scores which are mostly assigned by users. Figure below shows an example from [Neumann 1999].
S1..S5 are data sources, QCAs are quality correspondence assertions which are assigned by operators. EoY, Rep, … are IQ Metrics, i.e. Ease of Understanding, Reputation, etc.
QCAs are what we now call DQ Profile per Data Source
[Neumann 1999] Naumann, F. and Leser, U. and Freytag, J.C., Quality-driven integration of heterogeneous information systems, VLDB 1999
Data Quality Profiling Finer grained data quality metric [Mecalla 2003]
A tree where each node represents a schema object, e.g. Data Source, Relation, Column, and Data Quality Metric
[Mecalla 2003] ecella, M. and Scannapieco, M. and Virgillito, A. and Baldoni, R. and Catarci, T. and Batini, C., The DaQuinCIS broker: Querying data and their quality in cooperative information systems, Journal of Data Semantics, 2003
Data Quality Profiling A typical Data Quality Profile
Data Quality measurements are stored per schema object. Is unable to provide valid estimates for selection queries.Quality of information about Cannon products in a Sony website may not be good even if the web site has high quality data in general
Item Title Item Desc Num Available
Price Tax User Comments
S2IS Canon PShot 8 310 31
Camera 1000 Test
S3IS Canon PShot 375
XL-2 4 184 18DMC-TZ5K
Panasonic Lmx
340
DSC-W55 S ony Cshot 2 260 26
Object Metric Value
Shop.ItemTitle Completeness 0.83
Shop.ItemDesc Completeness 0.83
Shop.NumAvailable Completeness 0.50
Shop.Price Completeness 1.00
Shop.Tax Completeness 0.50
Shop.User Comments Completeness 0.16
Data Quality Profiling Data Quality Profiling for Selection Queries
B (Brand) M (Model) P (Price) I (Image)C S H 1.JpgC S H 2.JpgC S LC N H 4.JpgS S H 5.JpgS S H 6.JpgS S HS S L 7,jpgS N HS N L 8.JpgS N L
Brand:C = CannonS = SonyModel:S = SLRN = NormalPrice:H = HighL = Low
* Data quality is different for different selection queries.
* Naïve approach is to pre-compute each data quality metric for any possible selection condition.
* Search space will be exhaustive.
User Data Quality Preferences Preferences:
Preference as Partial Orders: Multi criteria decision making
Preference queries in Database Systems Data Quality Aware SQL Handing Inconsistency
User Data Quality Preferences Preferences:
User Preference is best modelled as sets of partial orders. [Saati 1995]
E.g. I prefer Tea over Coffee, then I prefer No Sugar over Sugar. Or: I prefer Price over Tax, then I prefer Accuracy over
Completeness, etc.
[Saaty 1006] T.L. Saaty. Multicriteria Decision Making: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. RWS Publications, 1996.
I have the following preference matrix about quality of the Price attributeI have the following preference matrix about quality of the Price attribute
Completeness Accuracy Consistency
Completeness - Very much preferred
Slightly preferred
Accuracy Very less preferred
- A little bit preferred
Consistency Slightly less preferred
A little bit less preferred
-
User Data Quality Preferences Preference queries [Govindarajan 2000]
Model User Preference as SQL
SELECT X WHERE Q PREFER P1, P2
SELECT X WHERE Q PREFER maximum WRT p
Problems: Designed for deductive databases, not suitable for Data Quality
Preferences. Not utilizing partial order definition for preference.
[Govindarajan 2000] K. Govindarajan, B. Jayaraman, and S. Mantha. Preference Queries in Deductive Databases. New Generation Computing, 2000
Data Quality Aware SQL A SQL extension to query any metric in data quality profile as [Column
Name.Metric Name] as part of the SQL query formulation. [Yeganeh 2009]
SELECT Title, Price FROM ShopItem WHERE Title.Completeness>0.8
SELECT Title, Title.Accuracy, Price FROM ShopItem ORDER BY Price.Accuracy
ORDER BY Price.Accuracy models a one dimensional preference that indicates sources with higher price accuracy are preferred, a two dimensional preference can not be intuitively achieved.
[Yeganeh 2009] Yeganeh, N. and Sadiq, S. and Deng, K. and Zhou, X., Data quality aware queries in collaborative information systems, APWeb 2009
Data Quality Aware SQLPrioritized preferences (Utilize preference as partial order concept):
E.g. from the sources with highest data quality, sources with high currency of price are prioritized over sources with high completeness of price.
Hierarchy ClauseSELECT Title AS t, Price AS p, [User Comments] AS u
FROM ShopItem WHERE ...
HIERARCHY(ShopItem)
p OVER (t,u) 7, u OVER (t) 3
HIERARCHY(ShopItem.p)
p.Currency OVER (p.Completeness) 3
Generally: HIERARCHY(a) a.x OVER (a.x',...) n
Why Hierarchy?
Intuitively human defines preferences as partial orders (pairs).
E.g. I prefer cofee to tea.
Data Quality Aware SQL Preferences as Partial order can be Inconsistent
For example: I prefer tea to coffee, I prefer coffee to milk, I prefer milk to tea.
Visual feedback to help user define consistence preference Size of the circles represents weight of item. Color represents consistency of
preferences (e.g. darker color means possible inconsistency).
Automatically fix inconsistencieswhen possible.
[Yeganeh 2010] Yeganeh, N.K. and Sadiq, S., Avoiding Inconsistency in User Preferences for Data Quality Aware Queries, BIS 2010
DQ Aware Query Planning Select query plan that maximizes the quality of query results.
How to estimate qualityof each data sourcefor complex queriesI.e. joins, aggregate,etc.
Consideration of the quality of service metrics of each sourcebecomes necessary in addition to Data Quality
Data Quality of joins between different data sources is very hardto compute
Communication Infrastructure
S1 S2 S3 Sn
Select * from join A,B,C,D on ...
Querying Interface
S3 S5 Sk Sj
A B C D
Si S9 S4 Sn
Sx S1 Sy Sb
.. .. .. ..
Possible join plans
Putting all together DQAQS: Data Quality Aware Query System – A Data Quality Aware
Data Integration System Data Quality Services (DQS) Services to generate data quality profiles. Data Quality Agents (DQA) Workers that manage generation and maintenance of data
quality profiles. Data Quality Aware Mediator (DQM) A mediator which is able to comprehend the Data
Quality aware SQL and orchestrate the query execution (i.e. Data Quality Aware Query Planning)
DQA
DQA
Network / Cloud
S1
S2
S3
DQS
DQS
DQS
DQS
DQM
Questions?