data profilingdata profiling e 5 this paper examines the reasons for and the process of data...

15
Data Profiling A Quick Primer on the What and the Why of Data Integration AUTHORS Shankar Ganesh R Senior Technical Architect Architecture and Technology Services HCL Technologies, Chennai Sathish Kumar Srinivasan Enterprise Data Architect Architecture and Technology Services HCL Technologies, Chennai Subramanyam B S Lead Researcher ATS-Technical Research HCL Technologies, Chennai

Upload: others

Post on 04-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Data Profiling A Quick Primer on the What and the Why of Data Integration

AUTHORS

Shankar Ganesh R Senior Technical Architect Architecture and Technology Services HCL Technologies, Chennai

Sathish Kumar Srinivasan Enterprise Data Architect Architecture and Technology Services HCL Technologies, Chennai

Subramanyam B S

Lead Researcher ATS-Technical Research HCL Technologies, Chennai

Data Profiling

Pag

e2

Table of Contents

Introduction……………………………………………………………………. ...................... 4

The Need for Data Profiling………………………………….. .................................. 5

Structure Discovery……………………………………………… ................................... 7

Validation with Metadata ................................................................................. 7

Pattern Matching ................................................................................................ 8

Basic Statistics .................................................................................................... 8

Data Discovery……………………………………………………… .................................. 8

Standardization ................................................................................................... 8

Frequency Counts and Outliers ..................................................................... 9

Business Rule Validation ................................................................................. 9

Relationship Discovery…………………………………………. .................................. 9

Data Profiling – Typical Opportunity Areas…………. ................................. 10

Data Profiling Tools………………………………………………. ................................ 11

Data Profiling in Action – the Banking Sector…………. ........................... 12

Conclusion…………………………………………………………….. ................................ 15

Data Profiling

Pag

e3

Title: Data Profiling – A Quick Primer on the What and the Why of Data

Integration

© 2008, HCL Technologies Ltd.

August, 2008

Data Profiling

Pag

e4

Introduction

In today’s economic environment, businesses are facing increasing

pressure to reduce costs. In an effort to remain competitive, companies

are looking at all kinds of solutions Enterprise Resource Planning (ERP),

Customer Relationship Management (CRM), Supply Chain Management

(SCM), Stock Control, Logistics, and Business Intelligence (BI), to name

just a few. However, for any solution to deliver value, the data they

depend on needs to be accurate, complete, and consistent. In the Global

Data Management Survey conducted by Price Waterhouse Coopers, data is

considered the most important asset fundamental to an organization’s

success.1

Despite its importance, most companies do not have detailed information

about their data. As a result, the decision to proceed with a solution like

ERP, CRM, BI or SCM is fraught with the risk of implementation delays,

cost overruns or less than expected return on investment.

The Data Warehousing Institute (TDWI) reports

that 83 per cent of organizations suffer from

problems caused by poor data quality2

A Standish Group report indicates that 88 per

cent of data integration projects will fail, or

overrun their target budgets by 66 per cent3

Companies are also not completely sure of their data quality and in

another TDWI survey, half of the respondents said the quality of their data

is “excellent” or “good,” 44 per cent of respondents said that in reality, the

quality of their data is “worse than everyone thinks.”4 Rather than go by

the perception of the individuals managing the data, companies need to

resort to a data profiling exercise.

1 Price Waterhouse Coopers, P.18, Global Data Management Survey, 2001,http://sirnet.metamatrix.se/material/SIRNET_10/survey_01.pdf [June 2008] -> Date on which the site was accessed. 2 TDWI: BI/DW Education Survey Finds 83 Percent of Organizations Suffering from Poor

Master Data http://vendors.ittoolbox.com/profiles/tdwi-dw-professional-

education/news/survey-finds-83-percent-of-organizations-suffering-from-poor-master-data-

23 [June 2008] 3 Trillium Software, 2004 P.5, Data Integration and Data Quality Management

http://www.trilliumsoftware.com/site/content/resources/library/pdf_detail.asp?id=49&pdfRec

orded=1&type= 4 Ibid, Page 9

83 per cent of

organizations suffer

from problems

caused by poor data

quality. Just 12 per

cent of data

integration projects

are completed

within their target

budgets.

Data Profiling

Pag

e5

This paper examines the reasons for and the process of data profiling. It

also takes a look at data profiling opportunities.

The Need for Data Profiling

A company’s database contains information that touches most aspects of

its business activity market data, customer information, accounting

information, production details, sales records, billing details, collection

details, personnel records, salary records, and so on. This data is utilized

by the company for various business decisions, and it is therefore

imperative that the data in the database be consistent, accurate and

reliable. Figure 1 shows the factors affecting data quality and the effects

of poor data quality.

Figure 1: Causes and Effects of Poor Data Quality

Factors Affecting Data Quality

1. Inadequately articulated requirements

2. Improper data creation process

3. Invalid data structures

4. Duplicate data

5. Redundant data

6. Missing values

7. Incorrect data lengths

8. Data imported from the database of an

acquired company

9. Data imported from databases that belong

to business partners

10. Unusual values

11. Poor acceptance testing

Poor quality data

increases costs,

results in time

delays and loss of

business

Effects of Poor Data Quality

1. Increased transaction rework costs

2. Increased costs incurred in implementing new

systems

3. Time delays in delivering data to decision makers

4. Business and opportunity costs of lost customers

through poor service

5. Costs of lost production through supply chain

problems

Data Profiling

Pag

e6

Since the costs of poor data quality are high, increasingly companies are

“profiling data” to check its quality and suitability for business. Data

profiling uses “analytical techniques to discover the true content,

structure, and quality of data.”5 It is different from data analysis in that it

derives information about the data, and not business information from the

data. The purpose of data profiling is to

Locate instances of inaccurate data

Determine invalid values, structural violations, and data rule

violations

Find the data characteristics that are useful to a business analyst to

determine if the data matches the business intent.

Typically, data profiling is carried out before data integration is performed,

or before business critical software systems are launched. However, data

profiling should be carried out on critical data, at regular intervals, to

ensure the continuing accuracy of information.

Data Profiling

The data profiling process comprises structure discovery, data discovery

and relationship discovery, as shown in Figure 2, and is undertaken before

any data-driven initiatives are executed. Data profiling is performed using

a tool that

Automates the discovery process

Helps uncover the characteristics of the data

Helps uncover the relationships between data sources

5 See: http://etutorials.org/Misc/data+quality/Preface/ [June 2008]

Data Profiling

Pag

e7

Figure 2: Data Profiling Workflow

Structure Discovery

Structure problems are caused by data inconsistencies. Some problems

are also caused by legacy data sources that are still in use or have been

migrated to a new application.

Structure discovery is the process of examining complete columns or

tables of data, and determining whether the data in those columns or

tables is consistent with the expectations for that data. There are three

common structure discovery techniques.

Validation with metadata

Pattern matching

Use of basic statistics

Validation with Metadata

Metadata is defined by Wikipedia as "data about data"6 and describes the

data that is in a table or column. Metadata contains information that

indicates the data type and field length. It also indicates if a field can be

missing or null or if it should be unique. Most data has some associated

metadata or a description of the characteristics of the data.

6 See http://en.wikipedia.org/wiki/Metadata [June 2008]

Data Profiling

Pag

e8

Metadata validation analyzes the data and indicates, for example, whether

or not the field length is appropriate and if there are fields with missing

values. Validation also helps determine if the data collected is as per the

original plan, or if there are deviations.

Pattern Matching

Pattern matching determines if the data values in a field are consistent

across the data source and whether or not the information is in the

expected format.7 Pattern matching also checks for other format-specific

information about the data such as type and length.8

Basic Statistics

Basic statistics provide a snapshot of an entire data field by presenting

statistical information such as minimum and maximum values, mean,

median, mode, and standard deviation, to highlight aberrations from

normal values.9

Data Discovery

The second step in the data profiling process is data discovery. Data

discovery examines the problem areas that are indicated by structure

discovery by examining individual data elements. Data discovery

techniques use

Matching technology to uncover non-standard data

Frequency counts and outlier detection to find data elements that

don’t make sense

Standardization

Data in an organization comes from different sources consumers,

different departments and partners. Standardization helps discover

inconsistencies in the data and then provides a solution to address and fix

7 For example, a valid mobile telephone number, in India, could be entered in the database, in the format (+NN) nnnnnnnnnn, (0) nnnnnnnnnn, nnnnnnnnnn; where NN is the numeric code for the country, and n is a digit between 0 and 9. If a phone number is entered in a different format, the pattern report will indicate that the telephone number did not match a valid telephone number pattern. 8 Dorr, Brett and Herbert, Pat P. 4, Data Profiling: Designing the Blueprint for Improved Data Quality, http://www2.sas.com/proceedings/sugi30/102-30.pdf [June 2008] 9 For example, if customer orders range between 500 and 1000 units, an order of 10000 units would be considered abnormal and validated prior to its being entered into the system.

Data Profiling

Pag

e9

the inconsistency. For example HCL, HCLT, HCL Technologies, and HCL

Tech all represent the same organization.10 Any report that is generated

must account for the way the company is represented to avoid missing

important data points that can affect the output of future processes.

Frequency Counts and Outliers

Frequency count looks at how values are related according to data

occurrences. An outlier is “an observation that is numerically distant from

the rest of the data.”11 Outlier detection examines the data values that are

remarkably different from other values12. In essence, these techniques

eliminate the need to validate the entire data by highlighting the data

values that need further investigation.

Business Rule Validation

A business rule defines or constrains one aspect of your business that is

intended to influence the behavior of your business.13 Data profiling

software do not include business rules, since business rules are specific to

each organization. However, a robust data profiling process must be able

to build, store, and validate against an organization’s unique business

rules.

Relationship Discovery

Relationship discovery is the third part of the data profiling process and

provides information about the ways in which data records inter-relate.

These data records can be multiple records in the same data file, records

across data files or records across databases.14

Relationship discovery

Determines key relationships by using metadata, if available

Checks the relationships for the provision of a unique primary key

or a foreign key

Inspects the records that prevent a key from being unique

Identifies outstanding records that do not adhere to the relationship

10 Ibid 6 11 See http://en.wikipedia.org/wiki/Outlier [June 2008] 12 Outlier Detection http://www.dataflux.com/technology/methodology/data-profiling/outlier-detection.asp [June 2008] 13 Business Rules, http://www.agilemodeling.com/artifacts/businessRule.htm [June 2008] 14 Ibid 8

Data Profiling

Pag

e10

Data Profiling – Typical Opportunity Areas

Many companies still do not have a single view of the customer. Having a

single view enables a company to

Obtain a precise understanding of all the business that the

company is conducting with customers, across multiple units and

product lines

Identify cross-selling opportunities

Create targeted marketing campaigns

Some examples of data profiling are given below.

Example 1

In supply chain management, supply chains are dependent upon effective

procurement processes and accurate procurement information. A single

database that contains details about suppliers and the items that they sell

increases efficiency.

Data profiling is useful in integrating supplier details and information about

the items that they sell to help improve immediate efficiencies and in

facilitating the consolidation and integration of different processes and

systems.

Example 2

Data repetition in data sources is common. For example, in banking,

insurance or retail, an account holder’s name can be recorded

as FirstName MiddleName LastName; FirstName M Lastname; F M

LastName and so on.

Data profiling traces and removes such repetitions to improve data quality

and enhance business intelligence and thereby enable better customer

experience and profitability.

Example 3

Databases provide assorted customer related information, such as the

types of products sold; the product profitability and customer profitability.

Critical business decisions depend on the accuracy of information in these

databases.

Data Profiling

provides a single

view of the

customer. It helps

understand the

gamut of company-

customer

transactions,

identify cross-

selling

opportunities and

then aids in creating

targeted marketing

campaigns.

Data Profiling

Pag

e11

For example a credit card company or a telecom company can use data

profiling to create customer profiles. These customer profiles could help

the company customize products for specific individuals or groups.

Information in the customer profile about the individual’s payment

behavior enables the company to monitor its overall risk portfolio and

enhance an individual’s credit limit.

Data Profiling Tools

Data profiling is generally done by using specific software tools designed

for the purpose rather than using statistical tools. Table 1 compares

statistical tools with data profiling tools and illustrates the advantages of

using data profiling tools.

Statistical Tools Data Profiling Tools

Must formulate a large number of

queries and/or reports in order to test

rules against the data

Addresses all the stages of

data profiling

Execution is slow since rules are

executed serially

Processes a large amount of

data in a short period of time

Cannot discover rules and users do

not understand the actual structure or

content of the data without discovery

Includes discovery processes

Use of validation processes alone will

result in not discovering issues

Includes automatic discovery

and validation processes

Table 1: Statistical Tools vs. Data Profiling Tools

An effective data profiling tool addresses the following three phases15:

Initial profiling and data assessment

Integration of profiling into automated processes

Passing profiling results to data quality and data integration

processes

The data profiling solution16 should also aid in constructing data correction,

validation, and verification routines directly from the profiling reports.

15

P.9, Ibid 8 16

Some well known data profiling tools are Trillium Software from Harte-Hanks, DataFlux

from DataFlux Corporation, Data Insight XI from Business Objects, Information Analyzer from IBM and Data Explorer from Informatica. For more information you can visit http://mediaproducts.gartner.com/reprints/businessobjects/149359.html [August 2008]

Data Profiling

Pag

e12

Data Profiling in Action – the Banking Sector

One of the largest banks in Singapore, with an asset base of about US$93

Billion, needed to compute the amount of capital it required to guard

against financial and operational risks (Basel 2 norms).

Banks make their capital adequacy calculation on information such as

Elapsed days since the last scheduled payment on a loan by a

customer

Customer’s payment behavior, which is based on information in

his/her account

A bank’s capital adequacy assessment should tally with the financial data

submitted to regulatory bodies.

Our client realized that the accuracy of its projections was dependent on

the quality of its source data, and therefore decided to get its source data

profiled.

Solution architects from HCL carried out a two-stage data profiling process

to determine data quality.

I Analysis Stage

Determine if individual values are valid values for a column

Analyze column values to discover problems with uniqueness rules,

unexpected frequencies of specific values

Analyze structure rules that govern functional dependencies,

primary keys, foreign keys, synonyms and duplicate columns

II Validation Stage

Validate data rules to ensure that they hold true for

A row of data

All rows for a single business object

A collection of a business object

Different types of business objects

Case Study

Data Profiling

Pag

e13

As shown in Table 2 HCL architects ran a number of quality checks in the

data profiling process.

Quality Check Example

Domain Checking Gender fields should have a value

of either M or F

Range checking For age, the value should be less

than 125 and greater than 0

Referential integrity If an order shows that a customer

bought product X, then make sure

there actually is a product named X

Basic statistics, frequencies,

ranges and outliers

If a company has products that cost

between $100 and $1,000, flag any

that fall outside this range.

Uniqueness and missing value

validation

If a code is supposed to be unique,

make sure it is not being reused

Key identification If there is a defined primary

key/foreign key relationship across

tables, validate it by looking for

records that do not have their

corresponding related record

Data rule compliance If closed credit accounts must have

a zero balance, make sure there

are no records where the closed

account flag is true and the account

balance is greater than zero

Basic Statistics about the data Minimum Value

Maximum Value

Mean

Mode

Standard Deviation

Minimum field length

Maximum field length

Occurrences of Null values in

key-defined fields

Frequency distribution including

candidate columns for multi-

value compression

Invalid data formats

Table 2: Data Profiling Quality Checks

HCL developed a proof-of-concept to evaluate the advantages of using in-

house tools vis-à-vis third-party data profiling tools, before settling on

Teradata 17 Profiler.

17 Teradata is a registered trademark of Teradata Corporation

Data Profiling

Pag

e14

HCL’s approach to the data profiling process comprised of the following:

Create a data profiling plan for each source system

Formulate the business rules for

Checking data quality

Handling exceptional records

Implement the data profiling plan

Figure 3 shows the data profiling process using Teradata Profiler.

Data profiling enabled our client to gain a competitive advantage by

Being among the first banks to implement Basel 2 norms

Generating a trusted, accurate, reliable and standardized customer

and banking data source

Reducing business risks

Paving the way for the successful implementation of Master Data

Management (MDM) and SOA projects

Improving the Bank’s ability to satisfy compliance requirements

Source

Data

DataStage

Extract Enterprise Data

Warehouse

Staging

Area

Teradata Profiler

Report

Corrective

Action

(Source

Systems)

Business

Rules

Teradata

Load

Utilities

Figure 3: Data Profiling Flow

The increased

reliability of source

data brought about

by data profiling

paves the way for

implementing

projects such as

MDM and SOA

Data Profiling

Pag

e15

Conclusion

Databases in most companies have evolved in an ad-hoc manner, which

has resulted in information silos. Companies, therefore, do not have a

unified view of their customers, resulting in missed business opportunities

or increased cost of operations. Data integration addresses those issues,

but poses data verification challenges, since the source data are in diverse

databases. Most data integration and migration projects overshoot their

time and cost estimates because of the effort expended to understand the

source data.

Data profiling automates the identification of problematic source data,

inconsistencies, redundancies, and inaccuracies. Data profiling also

provides a factual foundation, based on which data can be cleansed and

then consolidated before integration. Some of the benefits of integrated,

accurate and validated data that is the outcome of data profiling, are:

Enhanced accuracy of account receivables resulting in increased

debt collection

Better customer service

Cross-selling

Focused brand marketing campaigns

Reduced operational costs

Fraud detection

Compliance to regulations

Since these are tangible benefits to a company, irrespective of the sector

in which it operates, data profiling is relevant to companies and

consultants alike.

Data quality

management

initiatives pay for

themselves in

commercial terms.

… some respondents

report up to ten

times payback on

the investment

involved – P. 25,

Price Waterhouse

Cooper Global Data

Management

Survey, 2004