scalable data quality

96
NoCOUG Presentation 11-13-2003 Using Metadata to Drive Data Using Metadata to Drive Data Quality Quality Hunting the Data Dust Bunnies Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003

Upload: shelly38

Post on 27-Jan-2015

106 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Scalable Data Quality

NoCOUG Presentation 11-13-2003

Using Metadata to Drive Data Using Metadata to Drive Data Quality Quality Hunting the Data Dust BunniesHunting the Data Dust Bunnies

John Murphy

Apex Solutions, Inc.

NoCOUG

11-13-2003

Page 2: Scalable Data Quality

22 NoCOUG Presentation 11-13-2003

Presentation OutlinePresentation Outline

1. The Cost - It’s always funny when it’s someone else…

2. Quality - Quality Principles and The Knowledge Worker

3. Data Quality4. Data Development 5. Metadata Management 6. Profile and Baseline Statistics7. Vendor Tools8. Wrap-up9. Some light reading

Page 3: Scalable Data Quality

33 NoCOUG Presentation 11-13-2003

The CostThe Cost

1. The Cost…

Page 4: Scalable Data Quality

44 NoCOUG Presentation 11-13-2003

The Cost…The Cost…

“Quality is free. What’s expensive is finding out how to do it it right the first time.” Philip Crosby

• A major credit service was ordered to pay $25M for release of hundreds of customers names of one bank to another bank because a confidentiality indicator was not set.

• Stock value of major health care insurer dropped 40% because analysts reported the insurer was unable to determine which “members” were active paying policy holders. Stock value dropped $3.7 Billion in 72 hours.

• The sale/merger of a major cable provider was delayed 9 months while the target company determined how many households it had under contract. Three separate processes calculated three values with a 80% discrepancy.

Page 5: Scalable Data Quality

55 NoCOUG Presentation 11-13-2003

The Cost…The Cost…

• The US Attorney General determined that 14% of all health care dollars or approximately $23 billion were the result of fraud or inaccurate billing.

• DMA estimates that greater than 20% of customer information housed by it’s members database is inaccurate or unusable.

• A major Telco has over 350 CUSTERMER tables with CUSTOMER data repeated as many as 40 times with 22 separate systems capable of generating or modifying customer data.

• A major communications company could not invoice 2.6% of it’s customers because the addresses provided were non-deliverable. Total Cost > $85M annually.

• A State Government annually sends out 300,000 motor vehicle registration notices with up to 20% undeliverable addresses.

• A B2B office supply company calculated that it saves an average of 70 cents per line item through web sales based on data entry validation at the source of order entry.

Page 6: Scalable Data Quality

66 NoCOUG Presentation 11-13-2003

The CostThe Cost

TDWI - 2002

Page 7: Scalable Data Quality

77 NoCOUG Presentation 11-13-2003

The Regulatory ChallengesThe Regulatory Challenges

No more “Corporate” data• New Privacy Regulations

– Direct Marketing Access– Telemarketing Access– Opt – In / Opt - Out

• New Customer Managed Data Regulations– HIPAA

• New Security Regulations– Insurance Black List Validation– Bank Transfer Validation

• Business Management – Certification of Financial Statements– Sarbanes – Oxley

These have teeth…

Page 8: Scalable Data Quality

88 NoCOUG Presentation 11-13-2003

Sources of Data Non-QualitySources of Data Non-Quality

TDWI - 2002

Page 9: Scalable Data Quality

99 NoCOUG Presentation 11-13-2003

Data Development and Data QualityData Development and Data Quality

2. Data Quality

Page 10: Scalable Data Quality

1010 NoCOUG Presentation 11-13-2003

Data Quality ProcessData Quality Process

There is a formal process to quantitatively evaluate the quality of corporate data assets.

The process outlined here is based on Larry English’s Total data Quality Management (TdQM)• Audit the current data resources • Assess the Quality• Measure Non-quality Costs• Reengineer and Cleanse data assets• Update Data Quality Process

Page 11: Scalable Data Quality

1111 NoCOUG Presentation 11-13-2003

Determination of Data QualityDetermination of Data Quality

TDWI -2002

Page 12: Scalable Data Quality

1212 NoCOUG Presentation 11-13-2003

Data Quality ProcessData Quality Process

1. Develop a Data Quality Process to Quantitatively evaluate the Quality of Corporate Data Assets.

2. Establish the Metadata Repository3. Implement Data Development and

Standardization Process4. Profile and Baseline your Data5. Use the Metadata to Improve your Data Quality6. Revise The Data Quality Process

Page 13: Scalable Data Quality

1313 NoCOUG Presentation 11-13-2003

Determination of Data QualityDetermination of Data Quality

TDWI - 2002

Page 14: Scalable Data Quality

1414 NoCOUG Presentation 11-13-2003

Customer Satisfaction ProfileCustomer Satisfaction Profile

Determine who are the most consistent users of specific Data entities. Select a sample set of attributes to be reviewed.

Publish the metadata report for the selected attributes

Select representatives in from the various business areas and knowledge workers to review the selected attributes and metadata.

Distribute the questionnaires, retrieve and score the results.

Report on and distribute the results.

Page 15: Scalable Data Quality

1515 NoCOUG Presentation 11-13-2003

Data Quality AssessmentsData Quality Assessments

Attribute Name: Version Status Date    

Attribute Mnemonic:Above Expectation

Meets Expectation

Below Expectation Unusable Not Used

Business Names are Clear and Understandable

Data Definitions conform to standards          

Data Domain Values are correct and complete          

Data Mnemonics are consistent and Understandable          

List of valid codes are complete and correct          

The business rules are correct and complete          

The data has value to the business          

The Refresh Frequency is correct          

The Data Steward is correct          

The Example data is correct          

Page 16: Scalable Data Quality

1616 NoCOUG Presentation 11-13-2003

Quality Assessment ResultsQuality Assessment Results

2.72.5

1.22.7

1.31.3

2.82.8

2.61.6

0 0.5 1 1.5 2 2.5 3

Business Names are Clear and Understandable

Data Definitions conform to standards

Data Domain Values are correct and complete

Data Mnemonics are consistant and Undestandable

List of valid codes are complete and correct

The business rules are correct and complete

The data has value to the business

The Refresh Frequency is correct

The Data Stew ard is correct

The Example data is correct

Acceptability Threshold

Problem Metadat

a

Page 17: Scalable Data Quality

1717 NoCOUG Presentation 11-13-2003

Quality Assessment ResultsQuality Assessment Results

2.72.5

2.82.1

2.62.4

2.82.8

2.62.2

0 0.5 1 1.5 2 2.5 3

Business Names are Clear and Understandable

Data Definitions conform to standards

Data Domain Values are correct and complete

Data Mnemonics are consistant and Undestandable

List of valid codes are complete and correct

The business rules are correct and complete

The data has value to the business

The Refresh Frequency is correct

The Data Stew ard is correct

The Example data is correct

Improving…

Acceptability Threshold

Page 18: Scalable Data Quality

1818 NoCOUG Presentation 11-13-2003

Quality Assessment ResultsQuality Assessment Results

2.72.9

2.82.8

2.92.7

2.82.8

2.63

0 0.5 1 1.5 2 2.5 3

Business Names are Clear and Understandable

Data Definitions conform to standards

Data Domain Values are correct and complete

Data Mnemonics are consistant and Undestandable

List of valid codes are complete and correct

The business rules are correct and complete

The data has value to the business

The Refresh Frequency is correct

The Data Stew ard is correct

The Example data is correct

Got it Right! Acceptability Threshold

Page 19: Scalable Data Quality

1919 NoCOUG Presentation 11-13-2003

Data Development and StandardizationData Development and Standardization

4. Building Data

Page 20: Scalable Data Quality

2020 NoCOUG Presentation 11-13-2003

Data Standardization ProcessData Standardization Process

Data Development and Approval Process

Integrated Data Model (Data Elements)

Resolved Issues

Data Requirements

IssueResolution

StewardsArchitect

ProposalPackage

Data Architect Data Architect

FunctionalReview

TechnicalReview

Data Administrator

Issues

MDR

Page 21: Scalable Data Quality

2121 NoCOUG Presentation 11-13-2003

Data Standardization ProcessData Standardization Process

Proposal Package – Data Model, Descriptive Information, Organization Information, Integration Information, Tool Specific Information

Technical Review – Model Compliance, Metadata Complete and accurate

Functional Review – Integration with Enterprise Model

Issue Resolution – Maintenance and Management Total Process < 30 days All based on an integrated web accessible

application. Results integrated to the Enterprise Metadata Repository.

Page 22: Scalable Data Quality

2222 NoCOUG Presentation 11-13-2003

Data Standardization Data Standardization

Getting to a single view of the truth Getting to a corporate owned process of data and

information management.

Describe Existing Data Assets

Addition of new business needs

Page 23: Scalable Data Quality

2323 NoCOUG Presentation 11-13-2003

Data Development ProcessData Development Process

There is a formal process for development, certification, modification and retirement of data.• Data Requirements lead directly to Physical Data Structures.• Data Products lead directly to Information Products.

The Data Standards Evaluation Guide• Enterprise level not subject area specific

– I can use “customer” throughout the organization– I can use “Quarterly Earnings” throughout the organization

• All the data objects have a common minimal set of attributes dependent upon their type.

– All data elements have a name, business name, data type, length, size or precision, collection of domain values etc.

• There are clear unambiguous examples of the data use• The data standards are followed by all development and

management teams.• The same data standards are used to evaluate internally

derived data as well as vendor acquired data.

Page 24: Scalable Data Quality

2424 NoCOUG Presentation 11-13-2003

Data Standardization ProcessData Standardization Process

Standardization is the basis of modeling – Why Model?• Find out what you are doing so you can do it

better• Discover data• Identify sharing partners for processes and data• Build framework for database that supports

business• Establish data stewards• Identify and eliminate redundant processes and

data

Check out ICAM / IDEF…

Page 25: Scalable Data Quality

2525 NoCOUG Presentation 11-13-2003

An example Process ModelAn example Process Model

Conduct Procurement

Funding

Statutes, Regulations & Policies

Notification to Vendor

Company Support Team Purchase Officer

A0

Purchase

Solicitation Announcement

Purchase Performance Analysis

Acquisition Guidance

Communication from Contractor

Industry Resource Data

RequirementPackage

Proposed Programs & Procurement Issues

Page 26: Scalable Data Quality

2626 NoCOUG Presentation 11-13-2003

Zachman Framework Process Zachman Framework Process

Objectives / Scope

List of things important to the

enterprise

List of processes the

enterprise performs

List of locations where the enterprise operates

List of organizational

units

List of business events / cycles

List of business goals /

strategies

Model of the Business

Entity relationship

diagram (including m:m, n-ary, attributed relationships)

Business process model (physical data flow diagram)

Logistics network (nodes

and links)

Organization chart, with

roles; skill sets; security issues.

Business master schedule

Business plan

Model of the Information

System

Data model (converged

entities, fully normalized)

Essential Data flow diagram;

application architecture

Distributed system

architecture

Human interface architecture (roles, data,

access)

Dependency diagram, entity

life history (process structure)

Business rule model

Technology Model

Data architecture (tables and

columns); map to legacy data

System design: structure chart, pseudo-code

System architecture (hardware,

software types)

User interface (how the system

will behave); security design

"Control flow" diagram (control

structure)

Business rule design

Detailed Representation

Data design (denormalized), physical storage

design

Detailed Program Design

Network architecture

Screens, security

architecture (who can see

what?)

Timing definitions

Rule specification in program logic

  (Working systems)

Function System

Converted dataExecutable programs

Communications facilities

Trained people Business events Enforced rules

Page 27: Scalable Data Quality

2727 NoCOUG Presentation 11-13-2003

The Data Model…The Data Model…

Data Model- A description of the organization of data in a manner that reflects the information structure of an enterprise

Logical Data Model - User perspective of enterprise information. Independent of target database or database management system

Entity – Person, Place, Thing or Concept Attribute – Detail descriptive information associated with an Entity Relation – The applied business rule to one or more entities Element - A named identifier of each of the entities and their attributes that are to be represented in a database.

Page 28: Scalable Data Quality

2828 NoCOUG Presentation 11-13-2003

Rules to Entities and AttributesRules to Entities and Attributes

There is more than one state. Each state may contain multiple cities. Each city is always associated with a

state. Each city has a population. Each city may maintain multiple roads. Each road has a repair status. Each state has a motto. Each state adopts a state bird Each state bird has a color.

STATE

STATE CodeCITY NameCITY POPULATION QuantityCITY ROAD Name CITY ROAD REPAIR StatusSTATE MOTTO TextSTATE BIRD NameSTATE BIRD COLOR Name

Page 29: Scalable Data Quality

2929 NoCOUG Presentation 11-13-2003

Resulting Populated Tables (3NF)Resulting Populated Tables (3NF)

VA Alexandria Route 1 2

VA Alexandria Franconia 1

MD Annapolis Franklin 1

MD Baltimore Broadway 3

AZ Tucson Houghton 2

AZ Tucson Broadway 2

IL Springfield Main 3

MA Springfield Concord 1

StateCode

CityName City Road Name

City RoadRepair Status

CITY ROAD

STATE

VA “ “

MD “ “

AZ “ “

IL “ “

MA “ “

StateCode

StateMotto

STATE CITY

VA Alexandria 200K

MD Annapolis 50K

MD Baltimore 1500K

AZ Tucson 200K

IL Springfield 40K

MA Springfield 45K

StateCode

CityName

CityPop.

STATE BIRD

Cardinal Red

Oriole Black

Cactus Wren Brown

Chickadee Brown

StateBird

State BirdColor

State BirdName

Cardinal

Oriole

Cactus Wren

Cardinal

Chickadee

Page 30: Scalable Data Quality

3030 NoCOUG Presentation 11-13-2003

Example Entity, Attributes, RelationshipsExample Entity, Attributes, Relationships

Becomes Road Kill on/ Kills

State Model

STATE Code (FK)CITY Name (FK)CITY ROAD NameCITY ROAD REPAIR Status

STATE Code (FK)CITY Name CITY POPULATION Quantity

STATE CodeSTATE MOTTO TextSTATE BIRD Name (FK)

Maintains

STATE STATE CITY

CITY ROAD

Contains

STATE BIRD NameSTATE BIRD COLOR Name

STATE BIRD

Adopted by/Adopts

Page 31: Scalable Data Quality

3131 NoCOUG Presentation 11-13-2003

Data StandardizationData Standardization

Data Element Standardization -The process of documenting, reviewing, and approving unique names, definitions, characteristics, and representations of data elements according to established procedures and conventions.

PrimeWord

PropertyModifier(s)

Class WordClass wordModifier(s)

GenericElement

Required1

0 - n

0 - n

Standard Data Element Structure

Page 32: Scalable Data Quality

3232 NoCOUG Presentation 11-13-2003

The Generic ElementThe Generic Element

The Generic Element - The part of a data element that establishes a structure and limits the allowable set of values of a data element. Generic elements classify the domains of data elements. Generic elements may have specific or general domains.

Examples – Code, Amount, Weight, Identifier

Domains – The range of values associated with an element. General Domains can be infinite ranges as with an ID number or Fixed as with a State Code.

Page 33: Scalable Data Quality

3333 NoCOUG Presentation 11-13-2003

Standardized Data ElementStandardized Data Element

Person Eye Color Code

PR-EY-CLR-CD

The code that represents the natural pigmentation of a person’s iris

EXAMPLE

BK . . . . . . . . . . . . . . BlackBL . . . . . . . . . . . . . . BlueBR . . . . . . . . . . . . . . BrownGR . . . . . . . . . . . . . . GreenGY . . . . . . . . . . . . . . GrayHZ . . . . . . . . . . . . . . HazelVI . . . . . . . . . . . . . . Violet

Domain values:

Element Name:

Access Name:

Definition Text:

Authority Reference Text:

Steward Name:

U.S. Code title 10, chapter 55

USD (P&R)

Page 34: Scalable Data Quality

3434 NoCOUG Presentation 11-13-2003

StandardsStandards

Name Standards• Comply with format• Single concept, clear, accurate and self explanatory• According to functional requirements not physical considerations• Upper and lower case alphabetic characters, hyphens (-) and spaces ( )• No abbreviations or acronyms, conjunctions, plurals,

articles, verbs or class words used as modifiers or prime words

Definition Standards• What the data is, not HOW, WHERE or WHEN used or WHO

uses• Add meaning to name• One interpretation, no multiple purpose phrases, unfamiliar

technical program, abbreviations or acronyms

Page 35: Scalable Data Quality

3535 NoCOUG Presentation 11-13-2003

Integration of the Data Through MetadataIntegration of the Data Through Metadata

Data Integration by Subject area

Subject Area 1

Subject Area 2

Subject Area 3

Page 36: Scalable Data Quality

3636 NoCOUG Presentation 11-13-2003

Data Model IntegrationData Model Integration

Brings together (joins) two or more approved Data Model views

Adds to the scope and usability of the Corporate Data Model (EDM)

Continues to support the activities of the department that the individual models were intended to support

Enables the sharing of information between the functional areas or components which the Data Models support

Page 37: Scalable Data Quality

3737 NoCOUG Presentation 11-13-2003

Enterprise Data ModelEnterprise Data Model

Use of Enterprise Data Use of Enterprise Data ModelModel

ORGANIZATION

SystemModels

Standard Metadata & Schemas

ComponentViews

ComponentModels

SECURITY-CLEARANCE

ORGANIZATION-SECURITY-CLEARANCE

ORGANIZATION

PERSON-SECURITY-CLEARANCE

• •

SECURITY-CLEARANCE

ORGANIZATION-SECURITY-CLEARANCE

ORGANIZATION

PERSON-SECURITY-CLEARANCE

• •

SECURITY-CLEARANCE

ORGANIZATION-SECURITY-CLEARANCE

ORGANIZATION

PERSON-SECURITY-CLEARANCE

• •

Functional Views

Functional Models

SECURITY-CLEARANCE

ORGANIZATION-SECURITY-CLEARANCE

ORGANIZATION

PERSON-SECURITY-CLEARANCE

• •

SECURITY-LEVEL ORGANIZATION-

SECURITY-LEVEL

PERSON-SECURITY-LEVEL

Page 38: Scalable Data Quality

3838 NoCOUG Presentation 11-13-2003

Metadata ManagementMetadata Management

5. Metadata Management

Page 39: Scalable Data Quality

3939 NoCOUG Presentation 11-13-2003

Data in Context!Data in Context!

Mr. End User Sets Context For His Data.

Page 40: Scalable Data Quality

4040 NoCOUG Presentation 11-13-2003

MetadataMetadata

Metadata is the data about data… Huh?

Metadata is the descriptive information used to set the context and limits around a specific piece of data.

• The metadata lets data become discreet and understandable by all communities that come in contact with a data element.

• Metadata is the intersection of certain facts about data that lets the data become unique.

• It makes data unique, understood and unambiguous.• The accumulation of Metadata creates a piece of data. The

more characteristics about the data you have the more unique and discreet the data can be.

Page 41: Scalable Data Quality

4141 NoCOUG Presentation 11-13-2003

Relevant MetadataRelevant Metadata

• Technical - Information on the physical warehouse and data.

• Operational / Business - Rules on the data and content• Administrative - Security, Group identification etc.

• The meta model is the standard content defining the attributes of any given data element in any one of these models. The content should address the needs of each community who comes in contact with the data element. The meta model components make the data element unique to each community and sub community.

Page 42: Scalable Data Quality

4242 NoCOUG Presentation 11-13-2003

Acquiring the MetadataAcquiring the Metadata

Data Modeling Tools – API and Extract to Repository

Reverse Engineered RDBMS – Export Extract ETL Tools – Data mapping, Source to Target

Mapping Scheduling Tools – Refresh Rates and Schedules Business Intelligence Tools – Retrieval Use Current Data Dictionary

Page 43: Scalable Data Quality

4343 NoCOUG Presentation 11-13-2003

Technical MetadataTechnical Metadata

Physical Descriptive Qualities • Standardized Name • Mnemonic• Data type• Length• Precision• Data definition• Unit of Measure• Associated Domain Values• Transformation Rules• Derivation Rule• Primary and Alternate Source• Entity Association• Security And Stability Control

Page 44: Scalable Data Quality

4444 NoCOUG Presentation 11-13-2003

Administrative and Operational MetadataAdministrative and Operational Metadata

Relates the Business perspective to the end user and Manages Content• Retention period• Update frequency• Primary and Optional Sources• Steward for Element• Associated Process Model• Modification History

• Associated Requirement Document• Business relations• Aggregation Rules• Subject area oriented to insure understanding by end

user

Page 45: Scalable Data Quality

4545 NoCOUG Presentation 11-13-2003

The Simple MetamodelThe Simple Metamodel

Entity

Entity Alias

Individual View

Attribute

Encoding /Lookup Tables

Attribute Alias Attribute Default

Relationship

Source System

Subject Area

Individual

Page 46: Scalable Data Quality

4646 NoCOUG Presentation 11-13-2003

The Common Meta ModelThe Common Meta Model

Subject Area

Attribute

Business Term Synonym

Business Term

Business Term Abbreviation Abreviation

Library

Data Model

Sub Model

Sub Model Entity

Model Entity

Model Attribute

Relationship

DBMS Attribute DBMS Instance

Server

Database

Datastore Constraint

Data Element

Repository

Based on Tannenbaum

Page 47: Scalable Data Quality

4747 NoCOUG Presentation 11-13-2003

The Common Warehouse MetamodelThe Common Warehouse Metamodel

Page 48: Scalable Data Quality

4848 NoCOUG Presentation 11-13-2003

Required Data Element Technical MetadataRequired Data Element Technical Metadata

Name Mnemonic Definition Data value source list text Decimal place count quantity Authority reference text Domain definition text Domain value identifiers Domain value definition text High and low range identifiers Maximum character count quantity Proposed attribute functional data steward Functional area identification code Unit measure name Data type name Security classification code Creation Date

Page 49: Scalable Data Quality

4949 NoCOUG Presentation 11-13-2003

Use of The Enterprise ToolsUse of The Enterprise Tools

EnterpriseData DictionarySystem (EDDS)

Enterprise Data

Model (EDM)

Migration/New Information systems

Prime Words

Data Elements

Metadata

Entities

Attributes

Relationships(business rules)

DatabaseTables

DatabaseColumns

Database

Rows

Enterprise Data Repository (EDR)

Database Dictionary

Associations and Table Joins

Page 50: Scalable Data Quality

5050 NoCOUG Presentation 11-13-2003

Profile the DataProfile the Data

6. Profile and Baseline Data

Page 51: Scalable Data Quality

5151 NoCOUG Presentation 11-13-2003

Audit Data Audit Data

Establish Table Statistics• Total Size in bytes including Indexes• When was it last refreshed• Is referential Integrity applied

Establish Row Statistics• How many rows• How many Columns / Table

Establish Column Statistics• How Many Unique Values• How many Null Values• How Many Values outside defined domain• If a Key value, how many duplicates

Page 52: Scalable Data Quality

5252 NoCOUG Presentation 11-13-2003

Some Simple StatisticsSome Simple Statistics

In Oracle – Run Analyze against tables, partitions, indexes and clusters. Allows you to determine sample size as a specific % of the total size or the specific number of rows. Default is a 1064 row set.

Example – 5.7 million rows in Transaction Table “analyze table transaction estimate statistics;”

• Statistics are estimated using a 1064 sample “analyze table transaction estimate statistics sample 20 percent”

• Statistics are estimated using 1.14 million rows

Statistics are store in several views

View Column Name Contents

user_tables num_tows total rows when analyzed

user indexes distinct_keys the number of distinct values in the indexed column

user_part_col_statistics num distinct number of distinct values in the column

user_tab_col_statistics num_distinct number of distinct values in the column

Page 53: Scalable Data Quality

5353 NoCOUG Presentation 11-13-2003

Getting StatisticsGetting Statistics

Get the Statistics…

SQL > select table_name, num_rows from user_tables where num_rows is not null

TABLE_NAME NUM_ROWS---------------------------------------------------------Transaction 5790230Account 1290211Product 308Location 2187Vendors 4203

Alternatively, you can use select count by table

SQL > select count(*) from transaction; Count(*)---------------------

5790230

Page 54: Scalable Data Quality

5454 NoCOUG Presentation 11-13-2003

Getting StatisticsGetting Statistics

To determine Unique counts of a column in a table:

SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE];

To Determine the number of NULL values in a column in a table:

SQL> SELECT COUNT(DISTINCT [COLUMN] ) FROM [TABLE] WHERE [TABLE]_NAME IS NULL;

To Determine if there are values outside a domain range

SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE]

WHERE [TABLE]_STATUS NOT IN (‘val1’,’val2’,’val3’);

Page 55: Scalable Data Quality

5555 NoCOUG Presentation 11-13-2003

Getting Usage StatisticsGetting Usage Statistics

What tables are being used? With audit on, audit data is loaded to DBA_AUDIT_OBJECT

• Create a table with columns for object_name, owner and hits.• Insert the data from DBA_AUDIT_OBJECT to you new table.• Clear out the data in DBA_AUDIT_OBJECT• Write the following report:

col obj_name form a30col owner form a20

col hits form 99,990 select obj_name, owner, hits from aud_summary;

OBJ_NAME OWNER HITS--------------------------------------------------------------------Region Finance 1,929Transaction Sales 18,916,344Account Sales 4,918,201

Page 56: Scalable Data Quality

5656 NoCOUG Presentation 11-13-2003

Baseline StatisticsBaseline Statistics

Based on the statistics collected• Use these as a baseline and save to meta data repository

operational metadata• Compare with planned statistics generated with

Knowledge Workers• Generate and publish reports covering the data• Use these as baseline statistics

Regenerate the statistics on a fixed period basis.• Compare and track with time

Page 57: Scalable Data Quality

5757 NoCOUG Presentation 11-13-2003

Quality AssessmentQuality Assessment

Data Quality Assessment - Q1

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

Null Values

Domain Deviation

Duplicate Keys

InComplete Row

Invalid Street Address

Invalid URL

Duplicate Accounts

Key

Qu

alit

y M

etri

cs

Percent

Actual - Q1

Target

Data Quality Assessment - Q2

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

Null Values

Domain Deviation

Duplicate Keys

InComplete Row

Invalid Street Address

Invalid URL

Duplicate Accounts

Ke

y Q

ua

lity

Me

tric

s

Percent

Actual -Q2

Target

• Establish Baseline KPI For Data Quality

• Perform Statistics on Sample Sets

• Compare results

Data Quality Assessment - Q3

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

Null Values

Domain Deviation

Duplicate Keys

InComplete Row

Invalid Street Address

Invalid URL

Duplicate Accounts

Ke

y Q

ua

lity

Me

tric

s

Percent

Actual Q3

TargetTime to Develop and Implement

Corrective Measures and push process change upstream

Page 58: Scalable Data Quality

5858 NoCOUG Presentation 11-13-2003

Total Error Tracking over TimeTotal Error Tracking over Time

Set Error reduction Schedule

Track Errors over time Note when new systems or

impact processes are added % Total Errors

0

1

2

3

4

5

6

Jan Feb Mar Apr May Jun Jul Aug

Quarter

% T

ota

l E

rro

r

Actual Error Rate %

Planned Error Rate %

Page 59: Scalable Data Quality

5959 NoCOUG Presentation 11-13-2003

Performance AssessmentPerformance Assessment

Customer Reporting Analysis

0

1

2

3

4

5

6

7

8

9

10

2-0

2

5-0

2

8-0

2

11-0

2

2-0

3

5-0

3

8-0

3

11-0

3

2-0

4

5-0

4

8-0

4

11-0

4

Performance Statistics for Monthly DW Load

Page 60: Scalable Data Quality

6060 NoCOUG Presentation 11-13-2003

Daily Performance StatisticsDaily Performance Statistics

FSYS MIPS Use for Friday, Sept 27

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

HourM

IPS

BATCH-IM

BATCH-PROD

BATCH-TEST

DDF-STP&DIST

DDF-OTHER

DDF-MIDSU

DDF-MEMMS

DDF-INTCARE

DDF-EPRO

DDF-ITELL

DDF-DPROP

DDF-BRIO

DB2-PROD

TSO

OVERHEAD UNCAPT

XPTR

Daily Performance Cycles

FSYS MIPS Use for Wednesday, Oct 2

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Hour

MIP

S

BATCH-IM

BATCH-PROD

BATCH-TEST

DDF-STP&DIST

DDF-OTHER

DDF-MIDSU

DDF-MEMMS

DDF-INTCARE

DDF-EPRO

DDF-ITELL

DDF-DPROP

DDF-BRIO

DB2-PROD

TSO

OVERHEAD UNCAPT

XPTR

EOM Performance Cycles

Normal Daily Load

-Daily On Demand Reporting

-Batch Reporting

End Of Month Daily Performance Cycle

-EOM Loads Impacting Normal Daily Reporting

- Additional CPU / Swap / Cache/ DB overhead

Page 61: Scalable Data Quality

6161 NoCOUG Presentation 11-13-2003

Monitoring Your DataMonitoring Your Data

1. Source Statistics – Data Distribution, Row Counts etc.2. Schedule Exceptions – Load Abends, Unavailable source or target3. System Statistics – Configuration Stats, System Performance logs4. Change Control – Model / Process change History5. Test Criteria And Scenarios – Scripts, data statistics, test

performance6. Meta Model – Metadata (domain values, operational metadata etc.)7. Load Statistics – Value Distribution, row counts, load rates8. Test / Production Data Statistics – Data Distribution, Row counts,

model revisions, refresh history9. Query Performance – End user query performance and statistics10. End User Access – Who’s accessing, when, what query / service

requested, when to they access, what business are they associated with

11. Web Logs – Monitor External user access and performance12. End User Feedback – Comments, complaints and whines.

Page 62: Scalable Data Quality

6262 NoCOUG Presentation 11-13-2003

Monitoring your DataMonitoring your Data

Intranet

ETL Process

ETL / Scheduler/ Modeling

App ManagerNT/2000

Scheduler

Development DatamartServer

Mid Tier Environment

Web Server

Document BrokerPortal Security

Internal Analyst

DBAModeler

Developer

UserAccessStats.

WWW

Remote End User

Development Area of Storage Array

Production DatamartServer

DataWarehouse

Reporting / AnalyticalData Mart

(Repeated)

End User

Initial LoadOperationalSource Data

Application

StagingArea

RDBMSManaged

Production / Reporting Servier

ETLDevelopment

ETLScripts

Dev /Test

IncrementalUpdates

"Extract" from use ofETL tool or ExtractQuery to a staging

area.

Fib

er C

hann

elC

onne

ctio

n

Fib

er C

hann

elC

onne

ctio

n

Meta Dataand

ModelingRepository

Production Area OfStorage Array

1

2

9

4

SourceStatistics

Sched.Exception

ChangeControlHistory

MetadataModel

Reporting

Test Criteriaand Results

SystemStatistics

LoadStats.

QueryPerf.Stats.

UserAccessStats.

3

5

6

7

8

9

1011

TestDataStats

Prod.DataStats

End UserFeedback

12

WebLogs

MonitorPoint

Typical Analytical

Environment

Page 63: Scalable Data Quality

6363 NoCOUG Presentation 11-13-2003

A reason for metadataA reason for metadata

 

Source Systems

ODSStaging

Data Warehouse

Data MartsAnalytics

End UserAccess

Info.Distribution

Metadata

Page 64: Scalable Data Quality

6464 NoCOUG Presentation 11-13-2003

Metadata and MonitoringMetadata and Monitoring

The Metadata provides a objective criteria based evaluation of the data from a quality / integrity standpoint.

The Metadata provides standards for data use and quality assurance at all levels from the enterprise to the individual.

The Metadata ensures continuity in the data independently from the applications and users accessing it. Applications come and go but data is forever…

Metadata forces us to understand the data that we are using prior to its use.

Metadata promotes corporate development and retention of data assets.

Page 65: Scalable Data Quality

6565 NoCOUG Presentation 11-13-2003

The Leaky Pipe…The Leaky Pipe…

Existing Processes & Systems

Increased Processing CostsInability to relate customer Data

Poor Exception ManagementLost Confidence in Analytical Systems

Inability to React to Time to Market PressuresUnclear Definitions of the Business

Decreased Profits

• Gets Worse Everyday• Must Plug the Holes NOW• Easy ROI Justification

Page 66: Scalable Data Quality

6666 NoCOUG Presentation 11-13-2003

Vendor ToolsVendor Tools

7. Vendor Tools and Metadata

Page 67: Scalable Data Quality

6767 NoCOUG Presentation 11-13-2003

Vendor MetadataVendor Metadata

CASE Tools – ERWin, Designer 2000, Power Designer• Technical Metadata

RDBMS – Oracle, Informix, DB2• Technical Metadata• Operational Statistics – Row Counts, Domain Value Deviation, Utilization

Rates, Security ETL

• Transformation Mappings• Exceptions Management• Recency

BI• Utilization

ERP• Source of Record

Page 68: Scalable Data Quality

6868 NoCOUG Presentation 11-13-2003

Current Metadata ManagementCurrent Metadata Management

BusinessIntelligence

BrioCognos

Bus. ObjectsOracle

DiscovererMicroStrategy

Hyperion

CASE

ERWinBPWin

Designer 2000Rational Rose

Power Designer

RDBMS

OracleDB2 / MF

DB2 / UDBMS-SQL Server

TeradataInformaixSybase

ERP

PeopleSoftSAP

OracleApps

ETLEAI

InformaticaArdentTibco

MetadataRepository

Knowledge Worker

•Reflects Data After the fact

•Most are only current state views, no history

•No data development and Standardization Process

•No standards for Definitions

Page 69: Scalable Data Quality

6969 NoCOUG Presentation 11-13-2003

Bi-Directional Metadata ManagementBi-Directional Metadata Management

BusinessIntelligence

BrioCognos

Bus. ObjectsOracle

DiscovererMicroStrategy

Hyperion

CASE

ERWinBPWin

Designer 2000Rational Rose

Power Designer

RDBMS

OracleDB2 / MF

DB2 / UDBMS-SQL Server

TeradataInformaixSybase

ERP

PeopleSoftSAP

OracleApps

ETLEAI

InformaticaArdentTibco

MetadataRepository

Knowledge Worker

Page 70: Scalable Data Quality

7070 NoCOUG Presentation 11-13-2003

Vendor Stregths in Data QualityVendor Stregths in Data Quality

Page 71: Scalable Data Quality

7171 NoCOUG Presentation 11-13-2003

Wrap-upWrap-up

8. Wrap-up

Page 72: Scalable Data Quality

7272 NoCOUG Presentation 11-13-2003

Wrap upWrap up

Use metadata as part of your data quality effort• Incomplete Metadata is a pay me now or pay me later proposition

Develop statistics around the data distribution, refresh strategy, access etc.• Know what your data looks like. Know when it changes.

Use your metadata to answer the who, what, when, where and why about your data. • Tie your Data Quality Management (DQM) to your Total Quality

Management (TQM) to create a TdQM program. Understand the data distribution in the production

environment. • Understand the statistics about your data.

Publish statistics to common repository• Share your data quality standards and reports about the statistics.

Page 73: Scalable Data Quality

7373 NoCOUG Presentation 11-13-2003

Summary - ImplementSummary - Implement

Implement Validation Routines at data collection points.

Implement ETL and Data Quality Tools to automate the continuous detection, cleansing, and monitoring of key files and data flows.

Implement Data Quality Checks. Implement data quality checks or audits at reception points or within ETL processes. Stringent checks should be done at source systems and a data integration hub.

Consolidate Data Collection Points to minimize divergent data entry practices.

Consolidate Shared Data. Use a data warehouse or ODS to physically consolidate data used by multiple applications.

Minimize System Interfaces by (1) backfilling a data warehouse behind multiple independent data marts, (2) merging multiple operational systems or data warehouses, (3) consolidating multiple non-integrated legacy systems by implementing packaged enterprise application software, and/or (4) implementing a data integration hub (see next).

Page 74: Scalable Data Quality

7474 NoCOUG Presentation 11-13-2003

Summary - ImplementSummary - Implement

Implement a Data Integration Hub which can minimize system interfaces and provide a single source of clean, integrated data for multiple applications. This hub uses a variety of middleware (e.g. message queues, object request brokers) and transformation processes (ETL, data quality audits) to prepare and distribute data for use by multiple applications.

Implement a Meta Data Repository. Create a repository for managing meta data gleaned from all enterprise systems. The repository should provide a single place for systems analysts and business users to look up definitions of data elements, reports, and business views; trace the lineage of data elements from source to targets; identify data owners and custodians; and examine data quality reports. In addition, enterprise applications, such as a data integration hub or ETL tools, can use this meta data to determine how to clean, transform, or process data in its workflow.

Page 75: Scalable Data Quality

7575 NoCOUG Presentation 11-13-2003

Some Light Reading…Some Light Reading…

Metadata Solutions by Adrienne Tannenbaum Improving Data Warehousing and Business

Information Quality By Larry English The DOD 8320 M Standard for data creation and

management Data Warehousing and The Zachman Framework

by W.H. Inmon, John Zachman and John Geiger Common Warehouse Metamodel (CWM)

Specification

Page 76: Scalable Data Quality

7676 NoCOUG Presentation 11-13-2003

Working with complete attributes…Working with complete attributes…

A vital piece of previously omitted

metadata adversely impacts the outcome

of the game…

Page 77: Scalable Data Quality

7777 NoCOUG Presentation 11-13-2003

John Murphy – [email protected]

Suzanne Riddell [email protected]

Page 78: Scalable Data Quality

7878 NoCOUG Presentation 11-13-2003

Touch Points ImpactTouch Points Impact

OperationalSystems

Data Warehouse

Data Marts

Repositories

Add, Update Retrieve

Same Data

Multiple Locations

Multiple Touch Points

Page 79: Scalable Data Quality

7979 NoCOUG Presentation 11-13-2003

Quality Assessment Content Quality Assessment Content

Project Information • Identifier, Name, Manager,

Start Date, End Date Project Metrics

• Reused Data Object Count• New Data Object Count• Objects Modified

0

20

40

60

80

100

Project A Project B Project C Project D

Project Atributes

NewReusedRedundentUpdated

Page 80: Scalable Data Quality

8080 NoCOUG Presentation 11-13-2003

Metadata StrategyMetadata Strategy

1. Build Data Quality Process• Establish Data Quality Steering Committee• Establish Data Stewards• Establish Metadata Management Process• Establish Data Development and Certification Process

2. Audit existing metadata resources• Data models• ETL Applications• RDBMS Schemas• Collect and Certify existing metadata

3. Develop Meta Model• Determine key metadata sources and alternate sources

4. Develop Metadata Repository and Access Strategy• Implement Meta Model• Populate with available as is metadata

5. Define Gaps in the Metadata

Page 81: Scalable Data Quality

8181 NoCOUG Presentation 11-13-2003

Using Metadata For QualityUsing Metadata For Quality

1. Develop The Data Quality Process2. Implement Data Development and

Standardization Process3. Establish the Metadata Repository4. Profile and Baseline your Data5. Use the Metadata to Improve your Data Quality6. Revise The Data Quality Process

Page 82: Scalable Data Quality

8282 NoCOUG Presentation 11-13-2003

Statistical AnalysisStatistical Analysis

Determine your Sample Size• Size needs to be statistically significant • If in doubt use a true random 1%• Repeat complete process several times to gain

confidence and repeatability• Example:

– N=((Confidence Level x Est. Standard Deviation) / Bound)^2– N=((2.575x.330)/.11)^2– N=60 Rows

• Use as large a meaningful sample set as possible.

Page 83: Scalable Data Quality

8383 NoCOUG Presentation 11-13-2003

What Causes Data Warehouses to FailWhat Causes Data Warehouses to Fail

1. Failing to understand the purpose of data warehousing 2. Failing to understand who are the real “customers” of the data

warehouse3. Assuming the source data is “OK” because the operational systems

seem to work just fine4. Not developing enterprise-focused information architecture—even if

only developing a departmental data mart.5. Focusing on performance over information quality in data warehousing6. Not solving the information quality problems at the source7. Inappropriate “Ownership” of data correction/cleanup processes8. Not developing effective audit and control processes for the data

Extract, Correct, Transform and Load (ECTL) processes9. Misuse of information quality software in the data warehousing

processes10. Failing to exploit this opportunity to “correct” some of the wrongs

created by the previous 40 years of bad habits

Page 84: Scalable Data Quality

8484 NoCOUG Presentation 11-13-2003

Metadata Tool VendorsMetadata Tool Vendors

Data Advantage – www.dataadvantage.com CA Platinum - www.ca.com Arkidata – www.arkidata.com Sagent- www.sagent.com Dataflux – www.dataflux.com DataMentors – www.datamentors.com Vality – www.vality.com Evoke – www.evokesoft.com

Page 85: Scalable Data Quality

8585 NoCOUG Presentation 11-13-2003

Data and Information QualityData and Information Quality

2. Quality…

Page 86: Scalable Data Quality

8686 NoCOUG Presentation 11-13-2003

Quality – What it is and is notQuality – What it is and is not

Data and Information Quality is the ability to consistently meet the customers expectations and to adapt to those expectations as they change.

Quality is a process not an end point. Quality is understanding the impact of change and

the ability to Pro-actively adapt. Quality is building adaptable / survivable

processes – The less I have to change and keep my Knowledge Workers satisfaction high the more successful I’ll be.

Data and Information Quality is not Data Cleansing or transformations. By then it’s too late.

Quality impacts the costs associated with Scrap and Rework – Just Like Manufacturing!

Page 87: Scalable Data Quality

8787 NoCOUG Presentation 11-13-2003

The Quality LeadersThe Quality Leaders

The Quality Leaders• W. Edward Demming – 14 Points of Quality moving from

do it fast to do it right.• Philip Crosby -14 Step Quality Program – Determine what

is to be delivered, then the timeline.• Malcolm Baldridge – Determination of excellence,

commitment to change• Masaaki Kaizen – Continuous Process Improvement

Quality Frameworks• Six Sigma – A statistically repeatable approach• Lean Thinking – Simplify to eliminate waste• ISO 9000 – Quality measurement process

Page 88: Scalable Data Quality

8888 NoCOUG Presentation 11-13-2003

Quality ToolsQuality Tools

Six Sigma – A statistically repeatable approach• Define - Once a project has been selected by

management, the team identifies the problem, defines the requirements and sets an improvement goal.

• Measure - Used to validate the problem, refine the goal, then establish a baseline to track results.

• Analyze – Identifies the potential root causes and validate a hypothesis for corrective action.

• Improve – Develop solutions to root causes, test the solutions and measure the impact of the corrective action.

• Control - Establish standard methods and correct problems as needed. In other words, the corrective action should become the new requirement but additional problems may occur that will have to be adjusted for.

Page 89: Scalable Data Quality

8989 NoCOUG Presentation 11-13-2003

Quality Principles – The Knowledge WorkerQuality Principles – The Knowledge Worker

IT has a reason a reason to exist, it’s the Knowledge Worker At Toyota the Knowledge Worker is the “Honored Guest” It’s all for the knowledge worker. How well do you know

them?• Who are your knowledge workers?• What data do they need?• When do they use your data?• Where do they access it from?• Why do they need it to do their job?

Do your KWs feel like Honored Guests or cows in the pasture? Building a Profile of the Knowledge Workers

• Classes of Knowledge Workers– Farmers, Explorers, Inventors

• Determine the distribution of the Knowledge Workers• Determine their use profile

Page 90: Scalable Data Quality

9090 NoCOUG Presentation 11-13-2003

User Groups By Data Retrieval NeedsUser Groups By Data Retrieval Needs

80%Grazers

15%Explorers

5% Inventor

sGrazers – Push ReportingExplorers – Push with Drill DownInventors - Any, All and then

Some

Page 91: Scalable Data Quality

9191 NoCOUG Presentation 11-13-2003

Quality Shared – IT and UsersQuality Shared – IT and Users

Shared Ownership of the data• What data do I have?• How do I care for it?• What do I want to do with

it?• Where do I / my process

add value? Start with a target

• Build the car while your driving

• Everyone owns the process• Everyone participates• Breakdown the barriers

Page 92: Scalable Data Quality

9292 NoCOUG Presentation 11-13-2003

The Barriers to QualityThe Barriers to Quality

Knowledge Workers Gripes about IT• IT can’t figure out how to get my data on time – I’ll do it

in Access• IT has multiple calculations for the same values – I’ll

correct them by hand• It takes IT forever to build that table for me. I’ll do it in

Excel IT Gripes about Knowledge Workers

• KW won’t make the time to give us an answer• What KW’s said last month isn’t the same as this month• They are unrealistic in their expectations• We can’t decide that, it’s their decision• I don’t think they can understand a data model

Page 93: Scalable Data Quality

9393 NoCOUG Presentation 11-13-2003

Quality ToolsQuality Tools

Lean Thinking – Simplify to Eliminate Waste• Value - Defining what the customer wants. Any characteristic of

the product or service that doesn't align with the customers' perception of value is an opportunities to streamline.

• Value Stream - The value stream is the vehicle for delivering value to the customer. It is the entire chain of processes that develop, produce and deliver the desired outcome. Lean Enterprise tries to streamline the process at every step of the way.

• Flow - Sequencing the value stream (process flow) in such a manner as to eliminate any part of the process that doesn't add value.

• Pull - This is the concept of producing only what is needed, when it's needed. This tries to avoid the stockpiling of products by producing or providing only what the customer wants, when they want it.

• Perfection-The commitment to continually pursue the ideal means creating value while eliminating waste.

Page 94: Scalable Data Quality

9494 NoCOUG Presentation 11-13-2003

Total Quality data ManagementTotal Quality data Management

TQdM as a data quality standard process from Larry English Process 1- Assess the Data Definition Information

Architecture Quality• In – Starting Point• Out – Technical Data Definition Quality• Out – Information Groups• Out – Information Architecture• Out – Customer Satisfaction

Process 2 - Assess the Information Quality• In – Technical Data Definition Quality Assessment• Out – Information Value and Cost Chain• Out – Information Quality Reports

Process 3 – Measure Non-quality Information Costs• In – Outputs from Process 2• Out – Information Value and Cost Analysis

Page 95: Scalable Data Quality

9595 NoCOUG Presentation 11-13-2003

Total Quality data ManagementTotal Quality data Management

Process 4 –Re-engineer Data and Data Clean-up• In – Outputs from Process 3• Out – Data Defect identification• Out – Cleansed Data to Data Warehouse and Marts

Process 5 – Improve Information Process Quality• In – Production data, Raw and Clean• Out – Identified opportunities for Quality Improvement

Process 6 – Establish Information Quality Environment• In – All quality issues from Process 1 to 5• Out – Management of Process 1 to 5

Collects much of the existing Meta Data in existence.

Page 96: Scalable Data Quality

9696 NoCOUG Presentation 11-13-2003

Information Quality Improvement ProcessInformation Quality Improvement Process

P1Assess Data

Definition And Information

Architecture

P2Assess

Information Quality

P3Measure

Non-QualityInformation

Costs

P4Re-Engineer And Cleanse

Data

P5Improve

Information ProcessQuality

P6Establish Information Quality Environment