sql 2012 data quality services
DESCRIPTION
Slides from 60 minute screencast for DevelopMentor March 2012TRANSCRIPT
Data Quality ServicesIn SQL Server 2012@LynnLangit
Breakthrough Insights = Better BIPo
wer
View
New
sem
antic
se
rver
mod
el
for S
SAS
Data
Qua
lity
Serv
ices
Mas
ter D
ata
Serv
ices
Colu
mn-
stor
e In
dex (
10x-
100x
fast
er)
Sem
antic
Se
arch
&
File
tabl
e
What is Data Quality Services?
A set of tools and services that allow domain experts to improve Data Quality
• Produces result set with suggested improvements
• Does NOT change source data
Why Use DQS?
• Manually define, match , cleanseSME input
• Programmatically “”, then manually approve• Can ‘learn’
Machine Cleansing
• Can incorporate 3rd party data• Can integrate with other data processes (SSIS)Integration
When to use DQS (scenarios)Issue Detail
Completeness Is all information present?
Conformity Is all data in the correct format?
Consistency Do values represent the same meaning?
Accuracy Do data objects represent their real-world values?
Validity Do data values fall within acceptable ranges?
Duplication Are there multiple copies of the same data?
DQS Architecture
Installing DQS
SQL Server 2012
BI Edition
Enterprise edition
Not installed by defaultClient / Server / SSIS task
Grant 1 of 3 DQS roles on the DQS_Main db
Make your data accessible for SQL operations
Enable TCP/IP for remote DQS
Post Install
Must run ‘DQS Server Installer’ post SQL Install
Do MDS integration
DQS Components on SQL Server 2012
Data Quality Services client interface
How to Use DQS?
List of Basic Steps• Create/Refine/Use a Knowledge Base• Perform a Data Quality Evaluation• Generate output (results)
• List of Components• DQS Server• DQS Client(s)
How to Use DQS? Step 1 - KB
Knowledge Bases• Can use included KB• Can refine included KB• Can create KB from
source data• Can manually create KB
Parts of DQS – Domain Management
Adding Domain Values
• Correct• Error• Invalid
More on Domain Values
• Link as synonyms• Set as leading value
Regular or Composite Domains
More about Domain Management
• Domain Properties – Description, Language…• Reference Data – relate to 3rd party data• Domain Rules – RegEx/length, etc…rule-based• Domain Values – shows substitute values• Term-Based Relations – common word
corrections
Parts of DQS – Knowledge Discovery
Parts of DQS – Knowledge Discovery
Step two – Running Discovery
Parts of DQS – Knowledge Discovery
Step three – Correcting Values
DQS KB – Creating a Matching Policy
• Step One• Select data to be matched for each domain
DQS KB – Creating a Matching Policy• Step Two• Create matching rules per domains
• Similar• set similarity score, when matching score < 60• For numbers, set threshold (% or int)• For dates, set threshold (DD, MM or YY)
• Exact – identical values (score of 100)• Configure Weight, must sum to 100
• Can configure Prerequisites
DQS KB – Creating a Matching Policy• Step Three• Test matching rules per domains
• Click ‘Start’• Review ‘Matching Results’ tabs to compare one or more results
Matching – See ResultsMatching is usually performed AFTER cleansing and is focused on identifying (and removing) duplicates
More Matching Output
Using the DQS KB to do Cleaning
• Create or Open a Data Quality Project• Map the DQS KB to the new data• Perform Cleansing• Manage / View Results• Export corrected results
DQS Project -- Cleansing
DQS Cleaning in Process…
DQS Cleaning complete
DQS Cleaning – Manage Results
DQS Output file Information
Export file column names (with option to include "Data and Cleansing Info“)– XXX_Source - original source column value– XXX_Output - clean column value– XXX_Reason - reason column value was either valid or invalid– XXX_Confidence - column confidence percentage returned
by the DQS server algorithms– XXX_Status - column processing status (i.e. Correct, New,
Invalid, etc.)
DQS Administration - General
DQS Administration – Reference Data
DQS Administration - Logging
DQS Integration
List of Integration Points• API? – not at this time• SSIS task• MDS (Master Data Management)
DQS Cleansing Task in SSIS
DQS Cleansing Task in SSIS - mappingFor each input column define columns for • Source – contains input values• Output – contains correct or corrected
or invalid output values• Status – contains auto suggest, correct,
invalid or new
Running Package Status
DQS SSIS Task
DQS SSIS Task Complex Example
What is Master Data Management?
Define it• Central repository for data• Rule-based• Can work with DQS
Types of Data Quality Projects• Exact matches (WHERE = WHERE <> WHERE IN)• LIKE (%string matching)
T-SQL scripts (boolean match)
• CONTAINSFull-text matching
(semantic word match)
• SEMANTICSIMIALARITIESTABLESemantic Search (semantic phrase match)
• List belowSSIS tasks - (transactional,
multi-valued matching)
• Knowledge Base - rules/matches• Data Quality project - clean / correct dataDQS (KB matching)
• Versioned Entities, Attributes and RulesMDS (One view of truth)
New in RC0• Use knowledge import from projects back to your knowledge base
(KB) with Cleanse2KB• Use the Office speller as part of the DQS client• Use Composite Domain rules
– to correct values– to detect rules violations
• Import values from Excel – import values together with their synonyms
• Use unstructured composite domain values? – KB parsing is a new feature that takes advantage of your knowledge for a
more accurate parsing• Modify server log settings through the client UI
Performance Information
Resources
DQS Team Blog - hereDQS video – hereDQS on TechNet - hereMore samples – here
www.Develop.com
www.TeachingKidsProgramming.org
Do a Recipe Teach a Kid (Ages 10 ++)Microsoft SmallBasic Free Courseware (recipes)
Keep up with Data
Follow me @LynnLangit
RSS my blog www.LynnLangit.com
Hire me• To help build your BI/Big Data solution• To teach your team next gen BI with
SQL Server 2012