AN INTRODUCTION TODATA QUALITY SERVICES
koen verbeeckBI consultant
WHO AM I
• BI consultant @ Ordina
• member of SQLUG.be
• MCTS, MCITP in SQL Server 2008
• working with Microsoft BI for over 2 years
• beer and comic books enthusiast
• married with children…
INTRODUCTION
data quality?
• achieved through people, technology & processes• can be measured with various dimensions
• accuracy• consistency• completeness• duplicates (uniqueness)• timeliness• validness
• bad data = bad business
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran). - Wikipedia on Data Quality
INTRODUCTION
Data Quality
Issue Sample Data Problem
Standard Are data elements consistently defined and understood?
Gender code = M, F, U in one system and Gender code = 0, 1, 2 in another system
Complete Is all necessary data present ?
20% of customers’ last name is blank, 50% of zip-codes are 99999
Accurate Does the data accurately represent reality or a verifiable source?
A supplier is listed as ‘Active’ but went out of business six years ago
Valid Do data values fall within acceptable ranges?
Temperature recordings should be between -100°C and +100°C
Unique Data appears several times Prince, The Artist formerly known as Prince, The Artist, … are they the same person?
INTRODUCTION
Cleansing
MatchingProfiling
Monitoring
Monitoring Tracking and monitoring the state of Quality activities and Quality of Data
Cleansing Amend, remove or enrich data that is incorrect or incomplete. This includes correction, standardization and enrichment.
Profiling Analysis of the data source to provide insight into the quality of the data and help to identify data quality issues.
MatchingIdentifying, linking or merging related entries within or across sets of data.
OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
OVERVIEW OF DQS
Data Quality Services (DQS) is a Knowledge-Driven data quality solution, enabling IT Pros and data
stewards to easily improve the quality of their data
Knowledge Discovery
Semantics
Open and Extendible
Easy to use
Knowledge-Driven
OVERVIEW OF DQS
Based on a Data Quality Knowledge Base (DQKB)
Data Domains capture the semantics of your data
Acquires additional knowledge the more you use it
Support use of user-generated knowledge and IP by 3rd party reference data providers
Compelling user experience designed for increased productivity
OVERVIEW OF DQS
• easy installation• pre-installation checks
o SQL Server 2012 database engine (server)o .NET 4.0 & IE 6.0 or higher (client)
• installation of DQS using SQL Server set-up
• post-installation taskso run DQSInstaller.exeo grant DQS roles to userso enable TCP/IP
OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
BUILDING A KNOWLEDGE BASE
Build
UseDQ Projects
KnowledgeManagement
Match &
De-dupe
Correc
t
& sta
ndardiz
e
Knowledge
Manage
Discover / Explore Data / Connect
ReferenceData
Cloud Services
EnterpriseData
IntegratedProfiling
Notifications
ProgressStatus
Knowledge
Base
BUILDING A KNOWLEDGE BASE
DomainsRepresent the data
type
Values
Rules & Relation
s
3rd party Reference Data Knowledg
e Base
Composite
Domains
Matching Policy
Domains
DEMOour first knowledge base
Z85HVQ4
BUILDING A KNOWLEDGE BASE
• iterative process• knowledge discovery
• gather knowledge fromo Excelo SQL Server
• profiling of datao not the same as SSIS profiling task!
• automatically detects anomalies
BUILDING A KNOWLEDGE BASE
• domain management• knowledge about fields is kept in domains
• data steward cano create ruleso assign synonyms and correctionso create term based relations (str. street)o link domains together into
composite domains
• import knowledge fromo reference data (e.g. Azure Marketplace)o other knowledge bases
OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
DATA CLEANSING & MATCHING
• cleansing• why?o identifies incomplete or incorrect datao standardizes and enriches data by using
domain values, domain rules and reference data
• DQS cleansingo create a knowledge base or select an existing oneo create a data quality projecto 2-step process
– computer assisted cleansing– interactive cleansing
o export results
• St. --> street (corrected)• Microsot --> Microsoft (corrected)
• john.doe@hotmail (invalid)• 0472/34672 (invalid)
• Verbeek --> Verbeeck (suggested)
DATA CLEANSING & MATCHING
• matching• why?o identify duplicates with the data sourceo create consolidated view of data
• DQS matchingo build a matching policy in KBomatching trainingo create matching projecto choose survivors
DQ Client – Match Results
• Prince• The Artist Formerly Known
As Prince• The Artist
•
• Jon Doe, High Street 13, NY, [email protected]
John Doe, High Str, NY, [email protected]
DEMOcleanse datause a matching policy to find duplicates
DATA CLEANSING & MATCHING
• create a cleansing project• uses knowledge gathered in a DQS knowledge base
• simple user-friendly process
• profile results
DATA CLEANSING & MATCHING
• create a matching project• uses a matching policy created
in a knowledge base
• eliminates duplicates
• profile results
• the more knowledge that is added the better results will beo tip: clean-up the data first using a cleansing project
• choose survivors at the end
• export results into .csvor SQL Server
OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
SSIS INTEGRATION
Knowledge Base
Reference Data Definition
Values/Rules
New Records
Corrections &Suggestions
Correct Records
Invalid Records
Source + Mapping
Data correctionComponent
SSIS Package
Destination
Reference Data
Services
DQS Server
SSIS Data Flow
DEMOan SSIS cleansing project
SSIS INTEGRATION
• cleaning as a batch process
• only cleaning, matching is (not yet?) possible
• composite domains are supported
OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
CONCLUSION
Rich Knowledge BaseContinuous improvement and knowledge acquisitionBuild once, reuse for multiple DQ improvements
Focus on productivity and user experienceDesigned for business usersOut-of-the-box knowledge
Focus on cloud-basedReference DataUser-generated knowledgeIntegration with SSIS
Knowledge-driven Easy To Use Open & Extendible
RESOURCES
• DQS Team Blog @ MSDNhttp://blogs.msdn.com/b/dqs/
• DQS documentation @ MSDNhttp://msdn.microsoft.com/en-us/library/ff877917(v=sql.110).aspx
• SQL Server 2012 Resource Center (nice How-To videos)http://msdn.microsoft.com/en-us/sqlserver/ff898410.aspx
• DQS Forum @ MSDNhttp://social.msdn.microsoft.com/Forums/en-US/sqldataqualityservices/threads
• TechEd presentation about DQS by Elad Ziklikhttp://channel9.msdn.com/Events/TechEd/NorthAmerica/2011/DBI207
THE END thanks for watching!