scalable data quality
DESCRIPTION
TRANSCRIPT
NoCOUG Presentation 11-13-2003
Using Metadata to Drive Data Using Metadata to Drive Data Quality Quality Hunting the Data Dust BunniesHunting the Data Dust Bunnies
John Murphy
Apex Solutions, Inc.
NoCOUG
11-13-2003
22 NoCOUG Presentation 11-13-2003
Presentation OutlinePresentation Outline
1. The Cost - It’s always funny when it’s someone else…
2. Quality - Quality Principles and The Knowledge Worker
3. Data Quality4. Data Development 5. Metadata Management 6. Profile and Baseline Statistics7. Vendor Tools8. Wrap-up9. Some light reading
33 NoCOUG Presentation 11-13-2003
The CostThe Cost
1. The Cost…
44 NoCOUG Presentation 11-13-2003
The Cost…The Cost…
“Quality is free. What’s expensive is finding out how to do it it right the first time.” Philip Crosby
• A major credit service was ordered to pay $25M for release of hundreds of customers names of one bank to another bank because a confidentiality indicator was not set.
• Stock value of major health care insurer dropped 40% because analysts reported the insurer was unable to determine which “members” were active paying policy holders. Stock value dropped $3.7 Billion in 72 hours.
• The sale/merger of a major cable provider was delayed 9 months while the target company determined how many households it had under contract. Three separate processes calculated three values with a 80% discrepancy.
55 NoCOUG Presentation 11-13-2003
The Cost…The Cost…
• The US Attorney General determined that 14% of all health care dollars or approximately $23 billion were the result of fraud or inaccurate billing.
• DMA estimates that greater than 20% of customer information housed by it’s members database is inaccurate or unusable.
• A major Telco has over 350 CUSTERMER tables with CUSTOMER data repeated as many as 40 times with 22 separate systems capable of generating or modifying customer data.
• A major communications company could not invoice 2.6% of it’s customers because the addresses provided were non-deliverable. Total Cost > $85M annually.
• A State Government annually sends out 300,000 motor vehicle registration notices with up to 20% undeliverable addresses.
• A B2B office supply company calculated that it saves an average of 70 cents per line item through web sales based on data entry validation at the source of order entry.
66 NoCOUG Presentation 11-13-2003
The CostThe Cost
TDWI - 2002
77 NoCOUG Presentation 11-13-2003
The Regulatory ChallengesThe Regulatory Challenges
No more “Corporate” data• New Privacy Regulations
– Direct Marketing Access– Telemarketing Access– Opt – In / Opt - Out
• New Customer Managed Data Regulations– HIPAA
• New Security Regulations– Insurance Black List Validation– Bank Transfer Validation
• Business Management – Certification of Financial Statements– Sarbanes – Oxley
These have teeth…
88 NoCOUG Presentation 11-13-2003
Sources of Data Non-QualitySources of Data Non-Quality
TDWI - 2002
99 NoCOUG Presentation 11-13-2003
Data Development and Data QualityData Development and Data Quality
2. Data Quality
1010 NoCOUG Presentation 11-13-2003
Data Quality ProcessData Quality Process
There is a formal process to quantitatively evaluate the quality of corporate data assets.
The process outlined here is based on Larry English’s Total data Quality Management (TdQM)• Audit the current data resources • Assess the Quality• Measure Non-quality Costs• Reengineer and Cleanse data assets• Update Data Quality Process
1111 NoCOUG Presentation 11-13-2003
Determination of Data QualityDetermination of Data Quality
TDWI -2002
1212 NoCOUG Presentation 11-13-2003
Data Quality ProcessData Quality Process
1. Develop a Data Quality Process to Quantitatively evaluate the Quality of Corporate Data Assets.
2. Establish the Metadata Repository3. Implement Data Development and
Standardization Process4. Profile and Baseline your Data5. Use the Metadata to Improve your Data Quality6. Revise The Data Quality Process
1313 NoCOUG Presentation 11-13-2003
Determination of Data QualityDetermination of Data Quality
TDWI - 2002
1414 NoCOUG Presentation 11-13-2003
Customer Satisfaction ProfileCustomer Satisfaction Profile
Determine who are the most consistent users of specific Data entities. Select a sample set of attributes to be reviewed.
Publish the metadata report for the selected attributes
Select representatives in from the various business areas and knowledge workers to review the selected attributes and metadata.
Distribute the questionnaires, retrieve and score the results.
Report on and distribute the results.
1515 NoCOUG Presentation 11-13-2003
Data Quality AssessmentsData Quality Assessments
Attribute Name: Version Status Date
Attribute Mnemonic:Above Expectation
Meets Expectation
Below Expectation Unusable Not Used
Business Names are Clear and Understandable
Data Definitions conform to standards
Data Domain Values are correct and complete
Data Mnemonics are consistent and Understandable
List of valid codes are complete and correct
The business rules are correct and complete
The data has value to the business
The Refresh Frequency is correct
The Data Steward is correct
The Example data is correct
1616 NoCOUG Presentation 11-13-2003
Quality Assessment ResultsQuality Assessment Results
2.72.5
1.22.7
1.31.3
2.82.8
2.61.6
0 0.5 1 1.5 2 2.5 3
Business Names are Clear and Understandable
Data Definitions conform to standards
Data Domain Values are correct and complete
Data Mnemonics are consistant and Undestandable
List of valid codes are complete and correct
The business rules are correct and complete
The data has value to the business
The Refresh Frequency is correct
The Data Stew ard is correct
The Example data is correct
Acceptability Threshold
Problem Metadat
a
1717 NoCOUG Presentation 11-13-2003
Quality Assessment ResultsQuality Assessment Results
2.72.5
2.82.1
2.62.4
2.82.8
2.62.2
0 0.5 1 1.5 2 2.5 3
Business Names are Clear and Understandable
Data Definitions conform to standards
Data Domain Values are correct and complete
Data Mnemonics are consistant and Undestandable
List of valid codes are complete and correct
The business rules are correct and complete
The data has value to the business
The Refresh Frequency is correct
The Data Stew ard is correct
The Example data is correct
Improving…
Acceptability Threshold
1818 NoCOUG Presentation 11-13-2003
Quality Assessment ResultsQuality Assessment Results
2.72.9
2.82.8
2.92.7
2.82.8
2.63
0 0.5 1 1.5 2 2.5 3
Business Names are Clear and Understandable
Data Definitions conform to standards
Data Domain Values are correct and complete
Data Mnemonics are consistant and Undestandable
List of valid codes are complete and correct
The business rules are correct and complete
The data has value to the business
The Refresh Frequency is correct
The Data Stew ard is correct
The Example data is correct
Got it Right! Acceptability Threshold
1919 NoCOUG Presentation 11-13-2003
Data Development and StandardizationData Development and Standardization
4. Building Data
2020 NoCOUG Presentation 11-13-2003
Data Standardization ProcessData Standardization Process
Data Development and Approval Process
Integrated Data Model (Data Elements)
Resolved Issues
Data Requirements
IssueResolution
StewardsArchitect
ProposalPackage
Data Architect Data Architect
FunctionalReview
TechnicalReview
Data Administrator
Issues
MDR
2121 NoCOUG Presentation 11-13-2003
Data Standardization ProcessData Standardization Process
Proposal Package – Data Model, Descriptive Information, Organization Information, Integration Information, Tool Specific Information
Technical Review – Model Compliance, Metadata Complete and accurate
Functional Review – Integration with Enterprise Model
Issue Resolution – Maintenance and Management Total Process < 30 days All based on an integrated web accessible
application. Results integrated to the Enterprise Metadata Repository.
2222 NoCOUG Presentation 11-13-2003
Data Standardization Data Standardization
Getting to a single view of the truth Getting to a corporate owned process of data and
information management.
Describe Existing Data Assets
Addition of new business needs
2323 NoCOUG Presentation 11-13-2003
Data Development ProcessData Development Process
There is a formal process for development, certification, modification and retirement of data.• Data Requirements lead directly to Physical Data Structures.• Data Products lead directly to Information Products.
The Data Standards Evaluation Guide• Enterprise level not subject area specific
– I can use “customer” throughout the organization– I can use “Quarterly Earnings” throughout the organization
• All the data objects have a common minimal set of attributes dependent upon their type.
– All data elements have a name, business name, data type, length, size or precision, collection of domain values etc.
• There are clear unambiguous examples of the data use• The data standards are followed by all development and
management teams.• The same data standards are used to evaluate internally
derived data as well as vendor acquired data.
2424 NoCOUG Presentation 11-13-2003
Data Standardization ProcessData Standardization Process
Standardization is the basis of modeling – Why Model?• Find out what you are doing so you can do it
better• Discover data• Identify sharing partners for processes and data• Build framework for database that supports
business• Establish data stewards• Identify and eliminate redundant processes and
data
Check out ICAM / IDEF…
2525 NoCOUG Presentation 11-13-2003
An example Process ModelAn example Process Model
Conduct Procurement
Funding
Statutes, Regulations & Policies
Notification to Vendor
Company Support Team Purchase Officer
A0
Purchase
Solicitation Announcement
Purchase Performance Analysis
Acquisition Guidance
Communication from Contractor
Industry Resource Data
RequirementPackage
Proposed Programs & Procurement Issues
2626 NoCOUG Presentation 11-13-2003
Zachman Framework Process Zachman Framework Process
Objectives / Scope
List of things important to the
enterprise
List of processes the
enterprise performs
List of locations where the enterprise operates
List of organizational
units
List of business events / cycles
List of business goals /
strategies
Model of the Business
Entity relationship
diagram (including m:m, n-ary, attributed relationships)
Business process model (physical data flow diagram)
Logistics network (nodes
and links)
Organization chart, with
roles; skill sets; security issues.
Business master schedule
Business plan
Model of the Information
System
Data model (converged
entities, fully normalized)
Essential Data flow diagram;
application architecture
Distributed system
architecture
Human interface architecture (roles, data,
access)
Dependency diagram, entity
life history (process structure)
Business rule model
Technology Model
Data architecture (tables and
columns); map to legacy data
System design: structure chart, pseudo-code
System architecture (hardware,
software types)
User interface (how the system
will behave); security design
"Control flow" diagram (control
structure)
Business rule design
Detailed Representation
Data design (denormalized), physical storage
design
Detailed Program Design
Network architecture
Screens, security
architecture (who can see
what?)
Timing definitions
Rule specification in program logic
(Working systems)
Function System
Converted dataExecutable programs
Communications facilities
Trained people Business events Enforced rules
2727 NoCOUG Presentation 11-13-2003
The Data Model…The Data Model…
Data Model- A description of the organization of data in a manner that reflects the information structure of an enterprise
Logical Data Model - User perspective of enterprise information. Independent of target database or database management system
Entity – Person, Place, Thing or Concept Attribute – Detail descriptive information associated with an Entity Relation – The applied business rule to one or more entities Element - A named identifier of each of the entities and their attributes that are to be represented in a database.
2828 NoCOUG Presentation 11-13-2003
Rules to Entities and AttributesRules to Entities and Attributes
There is more than one state. Each state may contain multiple cities. Each city is always associated with a
state. Each city has a population. Each city may maintain multiple roads. Each road has a repair status. Each state has a motto. Each state adopts a state bird Each state bird has a color.
STATE
STATE CodeCITY NameCITY POPULATION QuantityCITY ROAD Name CITY ROAD REPAIR StatusSTATE MOTTO TextSTATE BIRD NameSTATE BIRD COLOR Name
2929 NoCOUG Presentation 11-13-2003
Resulting Populated Tables (3NF)Resulting Populated Tables (3NF)
VA Alexandria Route 1 2
VA Alexandria Franconia 1
MD Annapolis Franklin 1
MD Baltimore Broadway 3
AZ Tucson Houghton 2
AZ Tucson Broadway 2
IL Springfield Main 3
MA Springfield Concord 1
StateCode
CityName City Road Name
City RoadRepair Status
CITY ROAD
STATE
VA “ “
MD “ “
AZ “ “
IL “ “
MA “ “
StateCode
StateMotto
STATE CITY
VA Alexandria 200K
MD Annapolis 50K
MD Baltimore 1500K
AZ Tucson 200K
IL Springfield 40K
MA Springfield 45K
StateCode
CityName
CityPop.
STATE BIRD
Cardinal Red
Oriole Black
Cactus Wren Brown
Chickadee Brown
StateBird
State BirdColor
State BirdName
Cardinal
Oriole
Cactus Wren
Cardinal
Chickadee
3030 NoCOUG Presentation 11-13-2003
Example Entity, Attributes, RelationshipsExample Entity, Attributes, Relationships
Becomes Road Kill on/ Kills
State Model
STATE Code (FK)CITY Name (FK)CITY ROAD NameCITY ROAD REPAIR Status
STATE Code (FK)CITY Name CITY POPULATION Quantity
STATE CodeSTATE MOTTO TextSTATE BIRD Name (FK)
Maintains
STATE STATE CITY
CITY ROAD
Contains
STATE BIRD NameSTATE BIRD COLOR Name
STATE BIRD
Adopted by/Adopts
3131 NoCOUG Presentation 11-13-2003
Data StandardizationData Standardization
Data Element Standardization -The process of documenting, reviewing, and approving unique names, definitions, characteristics, and representations of data elements according to established procedures and conventions.
PrimeWord
PropertyModifier(s)
Class WordClass wordModifier(s)
GenericElement
Required1
0 - n
0 - n
Standard Data Element Structure
3232 NoCOUG Presentation 11-13-2003
The Generic ElementThe Generic Element
The Generic Element - The part of a data element that establishes a structure and limits the allowable set of values of a data element. Generic elements classify the domains of data elements. Generic elements may have specific or general domains.
Examples – Code, Amount, Weight, Identifier
Domains – The range of values associated with an element. General Domains can be infinite ranges as with an ID number or Fixed as with a State Code.
3333 NoCOUG Presentation 11-13-2003
Standardized Data ElementStandardized Data Element
Person Eye Color Code
PR-EY-CLR-CD
The code that represents the natural pigmentation of a person’s iris
EXAMPLE
BK . . . . . . . . . . . . . . BlackBL . . . . . . . . . . . . . . BlueBR . . . . . . . . . . . . . . BrownGR . . . . . . . . . . . . . . GreenGY . . . . . . . . . . . . . . GrayHZ . . . . . . . . . . . . . . HazelVI . . . . . . . . . . . . . . Violet
Domain values:
Element Name:
Access Name:
Definition Text:
Authority Reference Text:
Steward Name:
U.S. Code title 10, chapter 55
USD (P&R)
3434 NoCOUG Presentation 11-13-2003
StandardsStandards
Name Standards• Comply with format• Single concept, clear, accurate and self explanatory• According to functional requirements not physical considerations• Upper and lower case alphabetic characters, hyphens (-) and spaces ( )• No abbreviations or acronyms, conjunctions, plurals,
articles, verbs or class words used as modifiers or prime words
Definition Standards• What the data is, not HOW, WHERE or WHEN used or WHO
uses• Add meaning to name• One interpretation, no multiple purpose phrases, unfamiliar
technical program, abbreviations or acronyms
3535 NoCOUG Presentation 11-13-2003
Integration of the Data Through MetadataIntegration of the Data Through Metadata
Data Integration by Subject area
Subject Area 1
Subject Area 2
Subject Area 3
3636 NoCOUG Presentation 11-13-2003
Data Model IntegrationData Model Integration
Brings together (joins) two or more approved Data Model views
Adds to the scope and usability of the Corporate Data Model (EDM)
Continues to support the activities of the department that the individual models were intended to support
Enables the sharing of information between the functional areas or components which the Data Models support
3737 NoCOUG Presentation 11-13-2003
Enterprise Data ModelEnterprise Data Model
Use of Enterprise Data Use of Enterprise Data ModelModel
ORGANIZATION
SystemModels
Standard Metadata & Schemas
ComponentViews
ComponentModels
SECURITY-CLEARANCE
ORGANIZATION-SECURITY-CLEARANCE
ORGANIZATION
PERSON-SECURITY-CLEARANCE
• •
SECURITY-CLEARANCE
ORGANIZATION-SECURITY-CLEARANCE
ORGANIZATION
PERSON-SECURITY-CLEARANCE
• •
SECURITY-CLEARANCE
ORGANIZATION-SECURITY-CLEARANCE
ORGANIZATION
PERSON-SECURITY-CLEARANCE
• •
Functional Views
Functional Models
SECURITY-CLEARANCE
ORGANIZATION-SECURITY-CLEARANCE
ORGANIZATION
PERSON-SECURITY-CLEARANCE
• •
SECURITY-LEVEL ORGANIZATION-
SECURITY-LEVEL
PERSON-SECURITY-LEVEL
3838 NoCOUG Presentation 11-13-2003
Metadata ManagementMetadata Management
5. Metadata Management
3939 NoCOUG Presentation 11-13-2003
Data in Context!Data in Context!
Mr. End User Sets Context For His Data.
4040 NoCOUG Presentation 11-13-2003
MetadataMetadata
Metadata is the data about data… Huh?
Metadata is the descriptive information used to set the context and limits around a specific piece of data.
• The metadata lets data become discreet and understandable by all communities that come in contact with a data element.
• Metadata is the intersection of certain facts about data that lets the data become unique.
• It makes data unique, understood and unambiguous.• The accumulation of Metadata creates a piece of data. The
more characteristics about the data you have the more unique and discreet the data can be.
4141 NoCOUG Presentation 11-13-2003
Relevant MetadataRelevant Metadata
• Technical - Information on the physical warehouse and data.
• Operational / Business - Rules on the data and content• Administrative - Security, Group identification etc.
• The meta model is the standard content defining the attributes of any given data element in any one of these models. The content should address the needs of each community who comes in contact with the data element. The meta model components make the data element unique to each community and sub community.
4242 NoCOUG Presentation 11-13-2003
Acquiring the MetadataAcquiring the Metadata
Data Modeling Tools – API and Extract to Repository
Reverse Engineered RDBMS – Export Extract ETL Tools – Data mapping, Source to Target
Mapping Scheduling Tools – Refresh Rates and Schedules Business Intelligence Tools – Retrieval Use Current Data Dictionary
4343 NoCOUG Presentation 11-13-2003
Technical MetadataTechnical Metadata
Physical Descriptive Qualities • Standardized Name • Mnemonic• Data type• Length• Precision• Data definition• Unit of Measure• Associated Domain Values• Transformation Rules• Derivation Rule• Primary and Alternate Source• Entity Association• Security And Stability Control
4444 NoCOUG Presentation 11-13-2003
Administrative and Operational MetadataAdministrative and Operational Metadata
Relates the Business perspective to the end user and Manages Content• Retention period• Update frequency• Primary and Optional Sources• Steward for Element• Associated Process Model• Modification History
• Associated Requirement Document• Business relations• Aggregation Rules• Subject area oriented to insure understanding by end
user
4545 NoCOUG Presentation 11-13-2003
The Simple MetamodelThe Simple Metamodel
Entity
Entity Alias
Individual View
Attribute
Encoding /Lookup Tables
Attribute Alias Attribute Default
Relationship
Source System
Subject Area
Individual
4646 NoCOUG Presentation 11-13-2003
The Common Meta ModelThe Common Meta Model
Subject Area
Attribute
Business Term Synonym
Business Term
Business Term Abbreviation Abreviation
Library
Data Model
Sub Model
Sub Model Entity
Model Entity
Model Attribute
Relationship
DBMS Attribute DBMS Instance
Server
Database
Datastore Constraint
Data Element
Repository
Based on Tannenbaum
4747 NoCOUG Presentation 11-13-2003
The Common Warehouse MetamodelThe Common Warehouse Metamodel
4848 NoCOUG Presentation 11-13-2003
Required Data Element Technical MetadataRequired Data Element Technical Metadata
Name Mnemonic Definition Data value source list text Decimal place count quantity Authority reference text Domain definition text Domain value identifiers Domain value definition text High and low range identifiers Maximum character count quantity Proposed attribute functional data steward Functional area identification code Unit measure name Data type name Security classification code Creation Date
4949 NoCOUG Presentation 11-13-2003
Use of The Enterprise ToolsUse of The Enterprise Tools
EnterpriseData DictionarySystem (EDDS)
Enterprise Data
Model (EDM)
Migration/New Information systems
Prime Words
Data Elements
Metadata
Entities
Attributes
Relationships(business rules)
DatabaseTables
DatabaseColumns
Database
Rows
Enterprise Data Repository (EDR)
Database Dictionary
Associations and Table Joins
5050 NoCOUG Presentation 11-13-2003
Profile the DataProfile the Data
6. Profile and Baseline Data
5151 NoCOUG Presentation 11-13-2003
Audit Data Audit Data
Establish Table Statistics• Total Size in bytes including Indexes• When was it last refreshed• Is referential Integrity applied
Establish Row Statistics• How many rows• How many Columns / Table
Establish Column Statistics• How Many Unique Values• How many Null Values• How Many Values outside defined domain• If a Key value, how many duplicates
5252 NoCOUG Presentation 11-13-2003
Some Simple StatisticsSome Simple Statistics
In Oracle – Run Analyze against tables, partitions, indexes and clusters. Allows you to determine sample size as a specific % of the total size or the specific number of rows. Default is a 1064 row set.
Example – 5.7 million rows in Transaction Table “analyze table transaction estimate statistics;”
• Statistics are estimated using a 1064 sample “analyze table transaction estimate statistics sample 20 percent”
• Statistics are estimated using 1.14 million rows
Statistics are store in several views
View Column Name Contents
user_tables num_tows total rows when analyzed
user indexes distinct_keys the number of distinct values in the indexed column
user_part_col_statistics num distinct number of distinct values in the column
user_tab_col_statistics num_distinct number of distinct values in the column
5353 NoCOUG Presentation 11-13-2003
Getting StatisticsGetting Statistics
Get the Statistics…
SQL > select table_name, num_rows from user_tables where num_rows is not null
TABLE_NAME NUM_ROWS---------------------------------------------------------Transaction 5790230Account 1290211Product 308Location 2187Vendors 4203
Alternatively, you can use select count by table
SQL > select count(*) from transaction; Count(*)---------------------
5790230
5454 NoCOUG Presentation 11-13-2003
Getting StatisticsGetting Statistics
To determine Unique counts of a column in a table:
SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE];
To Determine the number of NULL values in a column in a table:
SQL> SELECT COUNT(DISTINCT [COLUMN] ) FROM [TABLE] WHERE [TABLE]_NAME IS NULL;
To Determine if there are values outside a domain range
SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE]
WHERE [TABLE]_STATUS NOT IN (‘val1’,’val2’,’val3’);
5555 NoCOUG Presentation 11-13-2003
Getting Usage StatisticsGetting Usage Statistics
What tables are being used? With audit on, audit data is loaded to DBA_AUDIT_OBJECT
• Create a table with columns for object_name, owner and hits.• Insert the data from DBA_AUDIT_OBJECT to you new table.• Clear out the data in DBA_AUDIT_OBJECT• Write the following report:
col obj_name form a30col owner form a20
col hits form 99,990 select obj_name, owner, hits from aud_summary;
OBJ_NAME OWNER HITS--------------------------------------------------------------------Region Finance 1,929Transaction Sales 18,916,344Account Sales 4,918,201
5656 NoCOUG Presentation 11-13-2003
Baseline StatisticsBaseline Statistics
Based on the statistics collected• Use these as a baseline and save to meta data repository
operational metadata• Compare with planned statistics generated with
Knowledge Workers• Generate and publish reports covering the data• Use these as baseline statistics
Regenerate the statistics on a fixed period basis.• Compare and track with time
5757 NoCOUG Presentation 11-13-2003
Quality AssessmentQuality Assessment
Data Quality Assessment - Q1
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
3.00%
Null Values
Domain Deviation
Duplicate Keys
InComplete Row
Invalid Street Address
Invalid URL
Duplicate Accounts
Key
Qu
alit
y M
etri
cs
Percent
Actual - Q1
Target
Data Quality Assessment - Q2
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
3.00%
Null Values
Domain Deviation
Duplicate Keys
InComplete Row
Invalid Street Address
Invalid URL
Duplicate Accounts
Ke
y Q
ua
lity
Me
tric
s
Percent
Actual -Q2
Target
• Establish Baseline KPI For Data Quality
• Perform Statistics on Sample Sets
• Compare results
Data Quality Assessment - Q3
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
3.00%
Null Values
Domain Deviation
Duplicate Keys
InComplete Row
Invalid Street Address
Invalid URL
Duplicate Accounts
Ke
y Q
ua
lity
Me
tric
s
Percent
Actual Q3
TargetTime to Develop and Implement
Corrective Measures and push process change upstream
5858 NoCOUG Presentation 11-13-2003
Total Error Tracking over TimeTotal Error Tracking over Time
Set Error reduction Schedule
Track Errors over time Note when new systems or
impact processes are added % Total Errors
0
1
2
3
4
5
6
Jan Feb Mar Apr May Jun Jul Aug
Quarter
% T
ota
l E
rro
r
Actual Error Rate %
Planned Error Rate %
5959 NoCOUG Presentation 11-13-2003
Performance AssessmentPerformance Assessment
Customer Reporting Analysis
0
1
2
3
4
5
6
7
8
9
10
2-0
2
5-0
2
8-0
2
11-0
2
2-0
3
5-0
3
8-0
3
11-0
3
2-0
4
5-0
4
8-0
4
11-0
4
Performance Statistics for Monthly DW Load
6060 NoCOUG Presentation 11-13-2003
Daily Performance StatisticsDaily Performance Statistics
FSYS MIPS Use for Friday, Sept 27
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
HourM
IPS
BATCH-IM
BATCH-PROD
BATCH-TEST
DDF-STP&DIST
DDF-OTHER
DDF-MIDSU
DDF-MEMMS
DDF-INTCARE
DDF-EPRO
DDF-ITELL
DDF-DPROP
DDF-BRIO
DB2-PROD
TSO
OVERHEAD UNCAPT
XPTR
Daily Performance Cycles
FSYS MIPS Use for Wednesday, Oct 2
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hour
MIP
S
BATCH-IM
BATCH-PROD
BATCH-TEST
DDF-STP&DIST
DDF-OTHER
DDF-MIDSU
DDF-MEMMS
DDF-INTCARE
DDF-EPRO
DDF-ITELL
DDF-DPROP
DDF-BRIO
DB2-PROD
TSO
OVERHEAD UNCAPT
XPTR
EOM Performance Cycles
Normal Daily Load
-Daily On Demand Reporting
-Batch Reporting
End Of Month Daily Performance Cycle
-EOM Loads Impacting Normal Daily Reporting
- Additional CPU / Swap / Cache/ DB overhead
6161 NoCOUG Presentation 11-13-2003
Monitoring Your DataMonitoring Your Data
1. Source Statistics – Data Distribution, Row Counts etc.2. Schedule Exceptions – Load Abends, Unavailable source or target3. System Statistics – Configuration Stats, System Performance logs4. Change Control – Model / Process change History5. Test Criteria And Scenarios – Scripts, data statistics, test
performance6. Meta Model – Metadata (domain values, operational metadata etc.)7. Load Statistics – Value Distribution, row counts, load rates8. Test / Production Data Statistics – Data Distribution, Row counts,
model revisions, refresh history9. Query Performance – End user query performance and statistics10. End User Access – Who’s accessing, when, what query / service
requested, when to they access, what business are they associated with
11. Web Logs – Monitor External user access and performance12. End User Feedback – Comments, complaints and whines.
6262 NoCOUG Presentation 11-13-2003
Monitoring your DataMonitoring your Data
Intranet
ETL Process
ETL / Scheduler/ Modeling
App ManagerNT/2000
Scheduler
Development DatamartServer
Mid Tier Environment
Web Server
Document BrokerPortal Security
Internal Analyst
DBAModeler
Developer
UserAccessStats.
WWW
Remote End User
Development Area of Storage Array
Production DatamartServer
DataWarehouse
Reporting / AnalyticalData Mart
(Repeated)
End User
Initial LoadOperationalSource Data
Application
StagingArea
RDBMSManaged
Production / Reporting Servier
ETLDevelopment
ETLScripts
Dev /Test
IncrementalUpdates
"Extract" from use ofETL tool or ExtractQuery to a staging
area.
Fib
er C
hann
elC
onne
ctio
n
Fib
er C
hann
elC
onne
ctio
n
Meta Dataand
ModelingRepository
Production Area OfStorage Array
1
2
9
4
SourceStatistics
Sched.Exception
ChangeControlHistory
MetadataModel
Reporting
Test Criteriaand Results
SystemStatistics
LoadStats.
QueryPerf.Stats.
UserAccessStats.
3
5
6
7
8
9
1011
TestDataStats
Prod.DataStats
End UserFeedback
12
WebLogs
MonitorPoint
Typical Analytical
Environment
6363 NoCOUG Presentation 11-13-2003
A reason for metadataA reason for metadata
Source Systems
ODSStaging
Data Warehouse
Data MartsAnalytics
End UserAccess
Info.Distribution
Metadata
6464 NoCOUG Presentation 11-13-2003
Metadata and MonitoringMetadata and Monitoring
The Metadata provides a objective criteria based evaluation of the data from a quality / integrity standpoint.
The Metadata provides standards for data use and quality assurance at all levels from the enterprise to the individual.
The Metadata ensures continuity in the data independently from the applications and users accessing it. Applications come and go but data is forever…
Metadata forces us to understand the data that we are using prior to its use.
Metadata promotes corporate development and retention of data assets.
6565 NoCOUG Presentation 11-13-2003
The Leaky Pipe…The Leaky Pipe…
Existing Processes & Systems
Increased Processing CostsInability to relate customer Data
Poor Exception ManagementLost Confidence in Analytical Systems
Inability to React to Time to Market PressuresUnclear Definitions of the Business
Decreased Profits
• Gets Worse Everyday• Must Plug the Holes NOW• Easy ROI Justification
6666 NoCOUG Presentation 11-13-2003
Vendor ToolsVendor Tools
7. Vendor Tools and Metadata
6767 NoCOUG Presentation 11-13-2003
Vendor MetadataVendor Metadata
CASE Tools – ERWin, Designer 2000, Power Designer• Technical Metadata
RDBMS – Oracle, Informix, DB2• Technical Metadata• Operational Statistics – Row Counts, Domain Value Deviation, Utilization
Rates, Security ETL
• Transformation Mappings• Exceptions Management• Recency
BI• Utilization
ERP• Source of Record
6868 NoCOUG Presentation 11-13-2003
Current Metadata ManagementCurrent Metadata Management
BusinessIntelligence
BrioCognos
Bus. ObjectsOracle
DiscovererMicroStrategy
Hyperion
CASE
ERWinBPWin
Designer 2000Rational Rose
Power Designer
RDBMS
OracleDB2 / MF
DB2 / UDBMS-SQL Server
TeradataInformaixSybase
ERP
PeopleSoftSAP
OracleApps
ETLEAI
InformaticaArdentTibco
MetadataRepository
Knowledge Worker
•Reflects Data After the fact
•Most are only current state views, no history
•No data development and Standardization Process
•No standards for Definitions
6969 NoCOUG Presentation 11-13-2003
Bi-Directional Metadata ManagementBi-Directional Metadata Management
BusinessIntelligence
BrioCognos
Bus. ObjectsOracle
DiscovererMicroStrategy
Hyperion
CASE
ERWinBPWin
Designer 2000Rational Rose
Power Designer
RDBMS
OracleDB2 / MF
DB2 / UDBMS-SQL Server
TeradataInformaixSybase
ERP
PeopleSoftSAP
OracleApps
ETLEAI
InformaticaArdentTibco
MetadataRepository
Knowledge Worker
7070 NoCOUG Presentation 11-13-2003
Vendor Stregths in Data QualityVendor Stregths in Data Quality
7171 NoCOUG Presentation 11-13-2003
Wrap-upWrap-up
8. Wrap-up
7272 NoCOUG Presentation 11-13-2003
Wrap upWrap up
Use metadata as part of your data quality effort• Incomplete Metadata is a pay me now or pay me later proposition
Develop statistics around the data distribution, refresh strategy, access etc.• Know what your data looks like. Know when it changes.
Use your metadata to answer the who, what, when, where and why about your data. • Tie your Data Quality Management (DQM) to your Total Quality
Management (TQM) to create a TdQM program. Understand the data distribution in the production
environment. • Understand the statistics about your data.
Publish statistics to common repository• Share your data quality standards and reports about the statistics.
7373 NoCOUG Presentation 11-13-2003
Summary - ImplementSummary - Implement
Implement Validation Routines at data collection points.
Implement ETL and Data Quality Tools to automate the continuous detection, cleansing, and monitoring of key files and data flows.
Implement Data Quality Checks. Implement data quality checks or audits at reception points or within ETL processes. Stringent checks should be done at source systems and a data integration hub.
Consolidate Data Collection Points to minimize divergent data entry practices.
Consolidate Shared Data. Use a data warehouse or ODS to physically consolidate data used by multiple applications.
Minimize System Interfaces by (1) backfilling a data warehouse behind multiple independent data marts, (2) merging multiple operational systems or data warehouses, (3) consolidating multiple non-integrated legacy systems by implementing packaged enterprise application software, and/or (4) implementing a data integration hub (see next).
7474 NoCOUG Presentation 11-13-2003
Summary - ImplementSummary - Implement
Implement a Data Integration Hub which can minimize system interfaces and provide a single source of clean, integrated data for multiple applications. This hub uses a variety of middleware (e.g. message queues, object request brokers) and transformation processes (ETL, data quality audits) to prepare and distribute data for use by multiple applications.
Implement a Meta Data Repository. Create a repository for managing meta data gleaned from all enterprise systems. The repository should provide a single place for systems analysts and business users to look up definitions of data elements, reports, and business views; trace the lineage of data elements from source to targets; identify data owners and custodians; and examine data quality reports. In addition, enterprise applications, such as a data integration hub or ETL tools, can use this meta data to determine how to clean, transform, or process data in its workflow.
7575 NoCOUG Presentation 11-13-2003
Some Light Reading…Some Light Reading…
Metadata Solutions by Adrienne Tannenbaum Improving Data Warehousing and Business
Information Quality By Larry English The DOD 8320 M Standard for data creation and
management Data Warehousing and The Zachman Framework
by W.H. Inmon, John Zachman and John Geiger Common Warehouse Metamodel (CWM)
Specification
7676 NoCOUG Presentation 11-13-2003
Working with complete attributes…Working with complete attributes…
A vital piece of previously omitted
metadata adversely impacts the outcome
of the game…
7777 NoCOUG Presentation 11-13-2003
John Murphy – [email protected]
Suzanne Riddell [email protected]
7878 NoCOUG Presentation 11-13-2003
Touch Points ImpactTouch Points Impact
OperationalSystems
Data Warehouse
Data Marts
Repositories
Add, Update Retrieve
Same Data
Multiple Locations
Multiple Touch Points
7979 NoCOUG Presentation 11-13-2003
Quality Assessment Content Quality Assessment Content
Project Information • Identifier, Name, Manager,
Start Date, End Date Project Metrics
• Reused Data Object Count• New Data Object Count• Objects Modified
0
20
40
60
80
100
Project A Project B Project C Project D
Project Atributes
NewReusedRedundentUpdated
8080 NoCOUG Presentation 11-13-2003
Metadata StrategyMetadata Strategy
1. Build Data Quality Process• Establish Data Quality Steering Committee• Establish Data Stewards• Establish Metadata Management Process• Establish Data Development and Certification Process
2. Audit existing metadata resources• Data models• ETL Applications• RDBMS Schemas• Collect and Certify existing metadata
3. Develop Meta Model• Determine key metadata sources and alternate sources
4. Develop Metadata Repository and Access Strategy• Implement Meta Model• Populate with available as is metadata
5. Define Gaps in the Metadata
8181 NoCOUG Presentation 11-13-2003
Using Metadata For QualityUsing Metadata For Quality
1. Develop The Data Quality Process2. Implement Data Development and
Standardization Process3. Establish the Metadata Repository4. Profile and Baseline your Data5. Use the Metadata to Improve your Data Quality6. Revise The Data Quality Process
8282 NoCOUG Presentation 11-13-2003
Statistical AnalysisStatistical Analysis
Determine your Sample Size• Size needs to be statistically significant • If in doubt use a true random 1%• Repeat complete process several times to gain
confidence and repeatability• Example:
– N=((Confidence Level x Est. Standard Deviation) / Bound)^2– N=((2.575x.330)/.11)^2– N=60 Rows
• Use as large a meaningful sample set as possible.
8383 NoCOUG Presentation 11-13-2003
What Causes Data Warehouses to FailWhat Causes Data Warehouses to Fail
1. Failing to understand the purpose of data warehousing 2. Failing to understand who are the real “customers” of the data
warehouse3. Assuming the source data is “OK” because the operational systems
seem to work just fine4. Not developing enterprise-focused information architecture—even if
only developing a departmental data mart.5. Focusing on performance over information quality in data warehousing6. Not solving the information quality problems at the source7. Inappropriate “Ownership” of data correction/cleanup processes8. Not developing effective audit and control processes for the data
Extract, Correct, Transform and Load (ECTL) processes9. Misuse of information quality software in the data warehousing
processes10. Failing to exploit this opportunity to “correct” some of the wrongs
created by the previous 40 years of bad habits
8484 NoCOUG Presentation 11-13-2003
Metadata Tool VendorsMetadata Tool Vendors
Data Advantage – www.dataadvantage.com CA Platinum - www.ca.com Arkidata – www.arkidata.com Sagent- www.sagent.com Dataflux – www.dataflux.com DataMentors – www.datamentors.com Vality – www.vality.com Evoke – www.evokesoft.com
8585 NoCOUG Presentation 11-13-2003
Data and Information QualityData and Information Quality
2. Quality…
8686 NoCOUG Presentation 11-13-2003
Quality – What it is and is notQuality – What it is and is not
Data and Information Quality is the ability to consistently meet the customers expectations and to adapt to those expectations as they change.
Quality is a process not an end point. Quality is understanding the impact of change and
the ability to Pro-actively adapt. Quality is building adaptable / survivable
processes – The less I have to change and keep my Knowledge Workers satisfaction high the more successful I’ll be.
Data and Information Quality is not Data Cleansing or transformations. By then it’s too late.
Quality impacts the costs associated with Scrap and Rework – Just Like Manufacturing!
8787 NoCOUG Presentation 11-13-2003
The Quality LeadersThe Quality Leaders
The Quality Leaders• W. Edward Demming – 14 Points of Quality moving from
do it fast to do it right.• Philip Crosby -14 Step Quality Program – Determine what
is to be delivered, then the timeline.• Malcolm Baldridge – Determination of excellence,
commitment to change• Masaaki Kaizen – Continuous Process Improvement
Quality Frameworks• Six Sigma – A statistically repeatable approach• Lean Thinking – Simplify to eliminate waste• ISO 9000 – Quality measurement process
8888 NoCOUG Presentation 11-13-2003
Quality ToolsQuality Tools
Six Sigma – A statistically repeatable approach• Define - Once a project has been selected by
management, the team identifies the problem, defines the requirements and sets an improvement goal.
• Measure - Used to validate the problem, refine the goal, then establish a baseline to track results.
• Analyze – Identifies the potential root causes and validate a hypothesis for corrective action.
• Improve – Develop solutions to root causes, test the solutions and measure the impact of the corrective action.
• Control - Establish standard methods and correct problems as needed. In other words, the corrective action should become the new requirement but additional problems may occur that will have to be adjusted for.
8989 NoCOUG Presentation 11-13-2003
Quality Principles – The Knowledge WorkerQuality Principles – The Knowledge Worker
IT has a reason a reason to exist, it’s the Knowledge Worker At Toyota the Knowledge Worker is the “Honored Guest” It’s all for the knowledge worker. How well do you know
them?• Who are your knowledge workers?• What data do they need?• When do they use your data?• Where do they access it from?• Why do they need it to do their job?
Do your KWs feel like Honored Guests or cows in the pasture? Building a Profile of the Knowledge Workers
• Classes of Knowledge Workers– Farmers, Explorers, Inventors
• Determine the distribution of the Knowledge Workers• Determine their use profile
9090 NoCOUG Presentation 11-13-2003
User Groups By Data Retrieval NeedsUser Groups By Data Retrieval Needs
80%Grazers
15%Explorers
5% Inventor
sGrazers – Push ReportingExplorers – Push with Drill DownInventors - Any, All and then
Some
9191 NoCOUG Presentation 11-13-2003
Quality Shared – IT and UsersQuality Shared – IT and Users
Shared Ownership of the data• What data do I have?• How do I care for it?• What do I want to do with
it?• Where do I / my process
add value? Start with a target
• Build the car while your driving
• Everyone owns the process• Everyone participates• Breakdown the barriers
9292 NoCOUG Presentation 11-13-2003
The Barriers to QualityThe Barriers to Quality
Knowledge Workers Gripes about IT• IT can’t figure out how to get my data on time – I’ll do it
in Access• IT has multiple calculations for the same values – I’ll
correct them by hand• It takes IT forever to build that table for me. I’ll do it in
Excel IT Gripes about Knowledge Workers
• KW won’t make the time to give us an answer• What KW’s said last month isn’t the same as this month• They are unrealistic in their expectations• We can’t decide that, it’s their decision• I don’t think they can understand a data model
9393 NoCOUG Presentation 11-13-2003
Quality ToolsQuality Tools
Lean Thinking – Simplify to Eliminate Waste• Value - Defining what the customer wants. Any characteristic of
the product or service that doesn't align with the customers' perception of value is an opportunities to streamline.
• Value Stream - The value stream is the vehicle for delivering value to the customer. It is the entire chain of processes that develop, produce and deliver the desired outcome. Lean Enterprise tries to streamline the process at every step of the way.
• Flow - Sequencing the value stream (process flow) in such a manner as to eliminate any part of the process that doesn't add value.
• Pull - This is the concept of producing only what is needed, when it's needed. This tries to avoid the stockpiling of products by producing or providing only what the customer wants, when they want it.
• Perfection-The commitment to continually pursue the ideal means creating value while eliminating waste.
9494 NoCOUG Presentation 11-13-2003
Total Quality data ManagementTotal Quality data Management
TQdM as a data quality standard process from Larry English Process 1- Assess the Data Definition Information
Architecture Quality• In – Starting Point• Out – Technical Data Definition Quality• Out – Information Groups• Out – Information Architecture• Out – Customer Satisfaction
Process 2 - Assess the Information Quality• In – Technical Data Definition Quality Assessment• Out – Information Value and Cost Chain• Out – Information Quality Reports
Process 3 – Measure Non-quality Information Costs• In – Outputs from Process 2• Out – Information Value and Cost Analysis
9595 NoCOUG Presentation 11-13-2003
Total Quality data ManagementTotal Quality data Management
Process 4 –Re-engineer Data and Data Clean-up• In – Outputs from Process 3• Out – Data Defect identification• Out – Cleansed Data to Data Warehouse and Marts
Process 5 – Improve Information Process Quality• In – Production data, Raw and Clean• Out – Identified opportunities for Quality Improvement
Process 6 – Establish Information Quality Environment• In – All quality issues from Process 1 to 5• Out – Management of Process 1 to 5
Collects much of the existing Meta Data in existence.
9696 NoCOUG Presentation 11-13-2003
Information Quality Improvement ProcessInformation Quality Improvement Process
P1Assess Data
Definition And Information
Architecture
P2Assess
Information Quality
P3Measure
Non-QualityInformation
Costs
P4Re-Engineer And Cleanse
Data
P5Improve
Information ProcessQuality
P6Establish Information Quality Environment