mdm server suspect duplicate processing

IBM® InfoSphere™ Master Data Management Server

Best PracticesInfoSphere MDM Server

Suspect Duplicate Processing (SDP)

John BaldwinMDM Product Consultant

Stephanie HazlewoodMDM Product Architect

Charles JiaMDM Product Consultant

Lena WoolfMDM Product Architect

IBM

InfoSphere MDM Server Suspect Duplicate Processing

Executive SummaryMany established organizations end up having unmanaged master data. It might be the result of mergers and acquisitions, or it might be due to the independent maintenance of information repositories that are siloed by line of business (LOB) information. In either situation, the result is the same – useful information that could be shared and consistently maintained is not. Unmanaged master data leads to data inconsistency and inaccuracy. IBM® InfoSphere™ Master Data Management Server (InfoSphere MDM Server) enables the enterprise to create a single, trusted record for a party – that is, persons or organizations. It has similar capabilities for mastering product data and account data whilst having the ability to master “content” through integration with content management systems.

Suspect duplicate processing is the process for searching, matching and creating associations, or suspects, between existing parties in the system. These associations are made based on how well the parties match and at times the decision may be made to merge these suspects into what has been termed a “single view” or the “golden record” for a party in InfoSphere MDM Server.

The purpose of this document is to provide considerations, recommendations and strategies that you can use to put an effective suspect duplicate processing (SDP) solution in place. The SDP of party domain information is the focus of this particular document.


Table of Contents

InfoSphere MDM Server Suspect Duplicate Processing (SDP).................................................................. 1

Executive Summary .............................................................................................. 2

Understanding Suspect Duplicate Processing (SDP) ...................................... 5

The IBM Data Governance Framework ............................................................. 7

Data Governance Components in the Context of SDP .............................. 7

The SDP Solution: Implementation Tasks ................................................... 8

The SDP Solution: Roles ................................................................................. 9

Understanding Your Data ................................................................................ 11

Planning for Candidate Pool Search and Matching: Data Profiling and Selection of Critical Data Elements ............................................................ 11

Increasing Accuracy: Data Standardization .............................................. 13

Candidate Selection Process (Suspect Search) ................................................ 14

Matching .............................................................................................................. 17

Survivorship ........................................................................................................ 20

Planning for Survivorship Processing ................................................... 21

Merging Guaranteed Duplicate Parties ..................................................... 21

Persisting Guaranteed Duplicate Parties ................................................... 23

Data Survivorship ......................................................................................... 25

Data Load Strategies ........................................................................................... 30

Typical Recommended Approaches to Loading Data ............................ 30

Adding New External System After Initial Load ..................................... 34

Best Practices....................................................................................................... 35

Best Practices....................................................................................................... 35

Conclusion ........................................................................................................... 38

Further reading ................................................................................................... 39

Contributors ......................................................................................................... 40

Appendix A: Configuration Tips for the Data Standardization Feature .... 41


Appendix B: Quick Steps for Customizing the Criteria for Persisting Duplicates ............................................................................................................ 42

Notices.................................................................................................................. 44

Notices.................................................................................................................. 44

Trademarks....................................................................................................46


Understanding Suspect Duplicate Processing (SDP)InfoSphere MDM Server provides methods for maintaining the quality of your party data, by providing services that maintain a single and accurate record of a party across your enterprise.

Although maintaining high data quality is critical to an MDM solution, it is also very important that an enterprise consider the orchestration of how its people, processes, and technology will all work together to manage data assets as a whole. Employing the SDP functionality of MDM Server is but one piece of your larger data governance strategy, and we present the SDP feature in this document with that in mind.

As depicted in Figure 1, the “golden record” for a party is built from duplicate records provided by numerous source systems and is sometimes referred to as the “golden master” – the single version of the truth for a particular customer. Suspect duplicate processing (SDP) is the process of searching for, matching, and creating associations between (and at times merging party data) existing duplicate parties – the suspects - in the system.

Figure 1: Building the golden record using suspect duplicate processing

Data in source systems is profiled and analyzed1

2

Suspected duplicates are searched and matched

Data is loaded into MDM Server and cleansed

3

4 Suspects are linked. Golden Record could be auto-created

5 Data Stewards review suspects & create Golden Record Record

Golden Record

SourceSystem

Source System

InfoSphere MDM ServerInfoSphere MDM Server

Data StewardData StewardTeamTeam

1 2 53

4Business Business Analyst Analyst

SMESME


SDP can be an involved process, despite seeming straight-forward on the surface. The complexity of the SDP process is driven by the fact that many organizations lack trusted information. Information is often incomplete or out of date and at times inaccurate. So before you can determine how to find the duplicates in your enterprise, you need to figure out the topology of the data in the systems that will be contributing to your Master Data store. This can be done by profiling and analyzing the systems in order to better understand the nature of the data.

This best practices document does not focus on best practices for data discovery or profiling technologies, but we do stress the importance of engaging in these activities as part of your SDP implementation. You might consider using products such as InfoSphere Information Analyzer and InfoSphere Discovery, and additional reading on these products can be found in the “Further Reading” section of this document.

An important part of many Master Data solutions involves loading the data from existing source systems into a master data repository. In this process it is typical to cleanse and standardize the data so that what remains can be better searched and matched on. There are a number of options that address the question of when you should actually de-duplicate your data, and we address some of the options in the section titled, “Data Load Strategies.”

The actual suspect de-duplication process itself involves three broad tasks: searching, matching, and survivorship. The search process is carried out to build the suspect pool, or candidate list. These are the parties that will then be put through the matching and adjustment process to determine which suspect categorization they belong to. If there is a close enough match between two parties, the party will be allocated a possible duplicate score and suspect categorization, which is a product of the matching and non-matching values of their critical data elements. A suspect record will be created in many cases, and in the case of a guaranteed match, the information might simply be merged thereby forming the updated single, golden record.

The suspect duplicate processing (SDP) feature of InfoSphere MDM Server is very flexible and can be modified and configured extensively. Search, match, and survivorship rules are all available to be customized to meet the business requirements of your solution. This document does not delve into line-by-line detail of how to customize each SDP rule. For detailed information on this particular topic, consult the InfoSphere MDM Server Developer’s Guide.


The IBM Data Governance FrameworkData Governance is a discipline that embodies the execution and enforcement of authority over the management of data assets and the performance of data functions. It is the orchestration of people, process, and technology to enable an organization to use data as an enterprise asset. A data governance process focuses on activities and tools used across the enterprise to ensure the consistent movement, management, and modification of data.

Data Governance Components in the Context of SDPWhen resolving duplicates as part of suspect duplicate processing, decide what criteria (and information within or outside of the application) are used to resolve duplicates and survive data appropriately.

PoliciesPolicies and standards allow for and support the adoption and enforcement of Data Governance practices, roles, and responsibilities.

ProceduresThis activity identifies and uses existing processes as a base when developing formal procedures. Procedures are the detailed tasks. They should be included as part of process documentation.

For SDP there should be procedures that guide the following activities:

• Ensuring suspect records are regularly reviewed.

• Ensuring consistent decisions are being made about how suspect records get managed or resolved.

Data QualityThis activity considers the degree to which an organization understands, defines, and manages data quality across the enterprise. Master data assets should achieve and sustain high levels of quality as the system of record for the enterprise.

SecurityThe Security / Privacy / Compliance activity considers the degree to which an organization has put in place policies, processes, and technologies to protect its data from misuse.


The SDP Solution: Implementation TasksThe following is a brief description of the tasks that will have to be performed to successfully implement suspect duplicate processing using InfoSphere MDM Server. These tasks are discussed in greater detail in the remainder of this document.

Decide which systems will provide or consume SDP-processed information.Typically, InfoSphere MDM Server implementations involve many systems. Some may be sources of party data. Other consuming systems need to be fed information from InfoSphere MDM Server. It is important that every system involved in either providing or consuming master data has undergone analysis to have its suitability considered, from a business and technical perspective. The results of this analysis should indicate whether each system will either accept or treat as an exception the resulting party information after having been through SDP.

Conduct data profiling and data quality analysis for all participating source systems.Although it is important that you understand the kind of data stored in your source systems and how it is getting used, it is also critical to be familiar with the quality of this data. Perform an analysis of the data on all source systems involved to produce a data profile for each. As an example, if fields are inconsistently populated, the SDP search and matching processes will not produce the best results using its standard critical data element configuration. If a date of birth (DOB) is a high scoring critical data element, but it is rarely provided in a source system, you might miss potential duplicates as it is less likely to locate perfect matches with other suspect records based on this criteria. More information about how to select critical data elements is discussed later in this document.

Choose a matching approach.A good understanding of the systems above and the availability of data from them, coupled with knowing the requirements of the business, is needed to make decisions on which matching technique to use and at what point in the process to perform the matching. What is the tolerance of the business for missed duplicates or falsely identified duplicates (false positives)? Based on this level of tolerance, how will you manage the volume of matches identified? These are the kinds of questions to consider when choosing an approach to matching with InfoSphere MDM Server.

Determine how you will resolve duplicates. When possible duplicates are identified, someone has to decide how to resolve these party records. If there is a guaranteed match found, it is common practice to allow InfoSphere MDM Server to automatically merge these records. Similarly, if there are significant differences in the data, it is typical for InfoSphere MDM Server to ignore these records and not consider them suspects of each other. Records that are in neither of these aforementioned categories generally require human intervention – processing by data stewards – to resolve and determine whether the parties are indeed duplicates or not. InfoSphere MDM Server provides a Data Stewardship User Interface (DSUI) that facilitates this processing.


People who communicate directly with customers in the course of business might also add or update records online. In these cases, an add or an update function might trigger the creation of suspects. It might be possible to resolve these cases directly with the customer as they happen.

Select your approach to surviving data.Once matches have been found, will you merge together the information for guaranteed duplicate parties? If you merge the records, will you always merge the information? Is there a requirement to maintain duplicates between lines of business? Survivorship is the process that can take care of the remediation of overlapping and sometimes inconsistent information. By answering these and other questions, you will determine how you need to survive the data.

Choose an approach to initial and delta loading of source system data Consider how you will load data into InfoSphere MDM Server. These decisions are generally driven by your non-functional requirements including the volume of data to be loaded, the time allocated for the load process, and the technical capabilities of the source systems in place.

Select an approach to synchronizing your master data with source systems. If you are merging guaranteed duplicate parties, the physically persisted consolidated record becomes the “golden record” for a party. The information in this record might need to be fed back to the source systems or published to downstream systems, or both. The need to synchronize the “golden record” back to source systems will really depend on the implementation style of your MDM solution. We discuss some common implementation styles in the section titled “Survivorship” in this document.

During the design of your suspect processing solution, take care to understand the implications of running SDP on the data provided by your source systems. When you need to synchronize information with source systems you will need to understand the underlying data models for all systems involved and a mapping activity will be required to ensure the data gets propagated appropriately.

The SDP Solution: RolesThese role descriptions are very generic and likely change with every Enterprise, but they do give a place to start to describe the responsibilities and tasks that need to be addressed. As well as a brief description of the Roles, there is a rough assignment of the tasks above, which each Role would be expected to either carry out or to collaborate in.


Business Analysts / Subject Matter ExpertsBusiness analysts and subject matter experts have a very good understanding of how InfoSphere MDM Server and especially SDP can work. They also have the knowledge of what the enterprise wants and any constraints or business rules covering the processing of customer data.

Involved in Tasks:• Decide which systems will provide/consume SDP-processed information• Conduct data profiling and data quality analysis for all participating source

systems• Choose a matching approach• Select your approach to surviving data when possible duplicates are found.• Choose an approach to initial and delta loading of source system data• Select an approach to synchronizing your master data with consuming systems

Data StewardsData Stewardship activities address the degree to which an organization defines, organizes, and manages its Information Technology (IT) and business data governance roles and responsibilities. Data Stewards are accountable for making the decision to collapse suspected duplicate records or not. Data stewards work with others across IT and the business to validate SDP designs, particularly around match and survivorship criteria as well as data definitions.

Involved in Tasks:• Choose a matching approach

o Specifically, assist in the selection of critical data elements and match scoring if a deterministic approach is taken.

• Select your approach to surviving data when possible duplicates are foundo Specifically, engage in examining suspect records and performing

necessary resolution of suspects.

Customer Service RepresentativesCustomer Service Representatives (CSR) are the people that are in direct contact with customers from a service or sales point of view. In the course of their work, they commonly add and update records for a party. These activities might lead to suspects being identified in SDP or through direct interactions with a customer and might result in suspect resolution being done, possibly even on the spot.

Involved in Tasks:• Select your approach to surviving data when possible duplicates are found

o Specifically, engage in examining suspect records and performing necessary resolution of suspects.


Understanding Your Data Paving the way for a successful InfoSphere MDM Server project starts with understanding the information you already have. Specifically, you should confirm your understanding and learn about the data being stored in any source systems that will participate in the solution. Before attempting to define the business rules for matching duplicate data across systems, organizations should engage in discovery, profiling, and analysis of their data sources in order to understand the content, quality, and structure of data pertaining to person or organization.

Identify all data sources that will participate in your InfoSphere MDM Server solution. Understand the data they contain and note any interrelationships the data has across systems, any cross-source transformation logic, and existing business rules.

Planning for Candidate Pool Search and Matching: Data Profiling and Selection of Critical Data ElementsNot every data point is important from the suspect duplicate processing point of view. You must identify the set of critical data elements that best help to uniquely identify the person or organization according to your business. This set of critical data elements will be used to make matching decisions. For example, the person could be identified by name, date of birth, address, gender, or social security number (SSN). However, if gender data is missing in one of the source systems or is defaulted to “unknown”, then gender should not be considered a critical data element. As you can see, the process of selecting the set of critical data elements is highly dependant on the quality of the data in source systems.

We recommend taking an iterative approach when selecting your critical data elements:

• Identify the set of critical data elements based on business knowledge and existing business processes

• Perform data profiling for selected set of critical data elements

• Perform the analysis of data quality keeping in mind their purpose – this data will be used for duplicate matching process

• If low quality data is found, adjust the critical data element selection and perform the data analysis again.


Make sure to check the following data integrity aspects:

• Conformity to metadata (such as field length and mandatory flags). For example, Social Security Number may be mandatory in some organizations and in these cases it is always expected to be populated.

• Conformity to the expected data patterns or formats. For example, the Social Security Number is expected to contain nine numerals and should not contain letters or other symbols.

• Conformity to the expected data values. For example, the Social Security Number is expected to be mostly unique within the same source system. A large frequency of occurrence of the same SSN is an indication of data corruption and warrants further cleansing of data before it is suitable to be used as critical data element for duplicate matching.

• Identification of multiple components (or mixed domains) within a single field that might make understanding or accuracy difficult to achieve. For example, an address field also has an individual’s name.

• Look for any outlier cardinalities between business entities. For example, on average a party may have two addresses, but there may be some parties with hundreds of addresses.

Data profiling should be more than just an up-front source analysis activity. It is important to continuously monitor the quality of the data in the source systems to ensure that business decisions made during critical data element selection process are still valid.

You should continuously monitor the quality of the data being provided to the system to ensure critical data element selection remains valid.

This article will not go into details on how to perform data discovery and quality analysis. IBM offers a full spectrum of tools and technologies that address data quality issues. More specifically:

• IBM InfoSphere Discovery uses algorithms and a patented technology that allows you to build a 360 degree view of how your data sources are interrelated. Using these capabilities, InfoSphere Discovery identifies and documents what data you have, where it is located, and how it is linked across systems by intelligently capturing relationships and determining applied transformations and business rules.

• IBM InfoSphere Information Analyzer is a solution for ongoing data quality assessment, analysis, and continuous monitoring and exception management. Information Analyzer is focused on performing a thorough source analysis to uncover data issues, such as duplication, incompleteness, invalid values, format violations, and so on, and to ensure that data follows de-facto and documented business rules. It identifies anomalies that violate those business rules on an ongoing basis.


• IBM InfoSphere QualityStage complements IBM InfoSphere Information Analyzer by investigating free-form text fields such as names, addresses, and descriptions. InfoSphere QualityStage allows you to define rules for standardizing free-form text domains which is essential for effective probabilistic matching of potentially duplicate master data records.

Increasing Accuracy: Data StandardizationData profiling typically uncovers inconsistencies in free-form data across multiple source systems. For example, one source system maintains separate First Name and Last Name fields for person name while the other stores it as a single, free-form Person Name field. Data inconsistency can also exist even within the scope of the single source system if data is not standardized on input.

It is important to standardize free-form data that are designated as critical data elements to ensure consistent results for candidate selection and matching process. Typically, name, address, phone number, and identifier data fields require standardization.

InfoSphere MDM Server invokes standardization whenever name, address, and phone number information is added or updated for a person or organization. There are three standardizers available with InfoSphere MDM Server: the Default Standardizer, QualityStage, and Trillium. For a given installation of the product, only one of these can be configured – you can not mix and match them.

There are different approaches to standardizing data as part of InfoSphere MDM Server, but they all have one common goal – to ensure that critical data elements from multiple sources are standardized consistently before invoking the de-duplication process.

Let’s consider the following examples:

1. An InfoSphere MDM Server solution provides a master view of customer data across three legacy source systems by consolidating their information into InfoSphere MDM Server by using a batch feed.

a. The data from the batch feed could be standardized by Quality Stage or Trillium during the execution of InfoSphere MDM Server runtime services invoked by the batch load process. Runtime standardization will ensure that names, addresses, and phone numbers are stored in InfoSphere MDM Server using the same format.

b. Alternatively, the data extracts from source systems could be standardized in staging area before being loaded into InfoSphere MDM Server by using the batch load process. To avoid too much performance overhead, the batch input data can be configured to skip the runtime standardization of addresses as it has already been done in the staging area.


c. Specifically for initial load, the data extracts from the source systems could be standardized as part of direct initial load process that by-passes InfoSphere MDM Server runtime services. It is important to ensure that standardization logic is the same between direct initial load and runtime services used for delta batch processing.

2. Let’s imagine that the InfoSphere MDM Server solution also supports a new online Customer Service Representative (CSR) system that makes use of InfoSphere MDM Server runtime services.

a. The data from a CSR service request will be standardized by the InfoSphere MDM Server runtime services to ensure that names, addresses, and phone numbers are stored in InfoSphere MDM Server using the same format.

b. If the source system batch feed data was standardized in a staging area (or by a direct initial load process), it is important to ensure that the standardization logic is the same between the runtime and staging area.

Using Phonetic Keys with Standardization

In addition to standardizing name and address, InfoSphere MDM Server also generates and stores phonetic keys to consider phonetic variations of given names, street names, and city names when searching for parties as part of the candidate selection process. Using a phonetic key rather than its corresponding column will generally increase the size of the suspect candidate pool.

InfoSphere MDM Server supports the Nysiis and Soundex phonetic generators. The configured default phonetic key generator uses the Soundex algorithm – one industry standard. You can customize the generation of phonetic keys to plug in a custom phonetic key generator or to store the phonetic keys produced by a standardizer.

Given the nature of standardization and phonetics, only certain languages are supported. We recommend modifying the default standardization and phonetic key generator components to take into account regional nuances of handling names and addresses.

Candidate Selection Process (Suspect Search)Whenever person or organization data is added or updated in InfoSphere MDM Server, a search process is carried out to build the suspect pool, or a candidate list of parties that might be suspects. These are the parties that will then be put through the detailed matching and adjustment process to determine to which suspect categorization they belong.

Searching is done using a subset, and combinations of that subset, of the critical data elements that you selected as being able to best uniquely identify a person or organization according to the nature of the information that is stored in your various business systems. Again, we see the importance of the data profiling and


standardization! It is typical to customize the party search rule to take into account any differences from the critical data elements that get used by the search process provided by InfoSphere MDM Server out-of-the-box.

Customize your suspect search processing to ensure that your search for suspects performs well. For example, you do not want to unnecessarily retrieve detailed information for parties that are unlikely to be deemed matches later in the match process. A search process that has not been tuned might yield a huge suspect pool depending on the data volume in the system being searched and the criteria used in the search rule.

You should consider the following best practices when tuning the suspect duplicate processing candidate search rule to block your data:

• Align your search strategy with the match processing that you have chosen. For example, if you choose a probabilistic approach to matching, you will want to tune your suspect search accordingly.

o If your search criteria will yield too many candidates, there is a good chance that many will not end up being matches. This results in unnecessary processing and system performance degradation.

o If your search criteria will yield too few candidates you may end up with missed matches in the system.

• Create your search rule in an iterative manner.

• Search by name and address combinations

• When searching using address, group parts of the address into smaller searchable components. Some examples include:

o Street name and postal code

o Box number and postal code

o Rural route and postal code

o Street name and city

o Box number and city

o Rural route and city

• Modify the party search rule to accommodate the subset of critical data elements you wish to use from the set identified as the result of your data profiling activities.

• Search by party identification type and number


• Search by party contact method type and reference number

• Add only active parties to your resulting suspect candidate pool

• If certain party data elements are not being populated, don’t retrieve them!

• Choose either an iterative search method in Java or SQL and consider performance impact of each approach.

o Using Java involves executing a number of queries against the MDM repository – each built based on the combinations of search criteria deemed important to identify parties in the system. The results of these queries are then merged to form the final candidate pool.

o Using SQL, a single query statement including (UNION) the combinations of search criteria deemed important to identify parties in the system is executed once against the repository.

In general, the more iterative your search needs to be to accurately build your suspect pool, the better the performance you will have if you employ the latter, single SQL statement approach for search to create your suspect candidate pool. It’s worth prototyping your search algorithm using each approach to determine the implementation that performs best given the criteria and blocking methods.

• Optimize database indexes

o When changing the search criteria, use a database utility tool to optimize your SQL statements and generate indexes based on the optimized SQL. Applying these to the database will improve the overall performance of your search.

• If your InfoSphere MDM Server implementation uses only part of the data model, turn off these unused parts. You can make use of the Smart Inquires feature of InfoSphere MDM Server to configure and avoid making unnecessary queries to these empty or unused parts of the data model. See the InfoSphere MDM Server Developer’s Guide for detailed guidance on the configuration of this feature.


MatchingThe matching process involves an assessment of the candidate suspect pool that determines how closely each party matches with the incoming, or source, party information. If there is a close enough match between two parties found by the search process, it will be allocated a possible duplicate score and suspect categorization - which is a product of the matching and non-matching values of their critical data elements.

InfoSphere MDM Server is flexible in that it allows for both deterministic and probabilistic matching (by using Quality Stage integration or another external matching engine) options.

Deterministic Matching

Deterministic matching involves comparing the set of values for all of a party’s critical data elements with those of another. This comparison takes into account the presence, absence, and content of the values and results in a score. Values that are present and match create a unique score that is referred to as the match relevancy portion whereas the values present that do not match create the non-match relevancy part of the score. Taken together, these scores define the type of suspect that has been found. When there are values missing in one or both parties for a particular critical data element, this element does not get included in the creation of the unique match and non-match relevancy scores.

InfoSphere MDM Server provides a comprehensive set of match rules and matrices that are a starting point for optimizing the match process. Generally, once you define what your critical data elements are, you should determine which combinations of these elements will create entries for each suspect categorization and adjust the match matrices accordingly. For example, if your business did not identify tax ID as a matching element, you will not want to include this data as part of your match process.

Tip:

Removing entries in the match matrices that reference unnecessary or unused critical data elements is good practice, but this activity can take time.

A quick approach to accomplishing the same results might involve modifying the match rules to set any unnecessary attributes to always return zero.

On the other hand, you may want to create new entries in the match matrices when you consider additional critical data elements not already considered out-of-the box with InfoSphere MDM Server. The mechanics of customizing the matching process is detailed in the InfoSphere MDM Server Developer’s Guide.


Tip:

Adding entries in the match matrices that take into account new critical data elements is good practice, but this activity can take time.

A quick approach that may accomplish the same result involves leveraging the suspect adjustment rules to give impact to specific non-critical data elements by upgrading or downgrading the suspect categorization.

Go through the match matrices and determine which combinations of critical data elements accurately represent each suspect categorization for your business.

Probabilistic Matching

InfoSphere MDM Server can be configured to asynchronously refine and augment the deterministic match scoring of suspects for the highly probable duplicate (also known as an A2 match) and possible duplicate (also known as a B match) suspect categories with an additional probabilistic score. Probabilistic scores take into consideration the frequency of the occurrence of a data value within a particular distribution. For example, matching on last name Smith in North America will probably render a lower match score than matching on the last name DeFillipo. That is, the possibility that the last name Smith is a true match is less likely than the possibility that DeFillipo is a true match because Smith is a more last name in North America. This approach can be used to improve the accuracy of these suspect categories using the Quality Stage probabilistic engine. Essentially, InfoSphere MDM Server passes the source party and its suspects to the engine where the matching is then done and the resulting score is stored with the identified suspects.

It is also possible to entirely replace InfoSphere MDM Server deterministic matching with probabilistic matching by calling the InfoSphere Quality Stage probabilistic engine from the matching rule. The candidate selection rule would also need to be customized in order to implement the blocking logic from the InfoSphere Quality Stage rule.

If this probabilistic matching approach is selected, consider the following:

• Quality Stage Probabilistic Matching runs on IBM Information Server and will increase the business solution complexity of your implementation and will add some overhead to the application performance.

• If you have additional party information (data extensions) or changes to critical data that are not captured in the InfoSphere Quality Stage probabilistic matching jobs provided with InfoSphere MDM Server, customizations will be required.


With either of these approaches, you will want to consider your approach to managing the suspects that do get identified. For example, if your matching rules generate many suspects, consider what approach you might need to take to deal with the volume. That is, do you have enough data stewards to manually analyze and resolve these possible matches? You might consider using stricter matching rules to keep suspects to a minimum and identify only the best or closest matches. If you find that these rules are too strict, you can always broaden them. When you relax the rules, you’ll want to use Evergreening, a method that allows for the identification of any previously unidentified suspects in the system. We cover more on Evergreening in the Data Load Strategies section of this document.

If you choose to create suspects that require resolution, make sure you have a strategy in place to resolve them – or consider not creating suspects. You can always relax your matching rules down the road. Working iteratively will enable you to refine your search and matching rules to ensure that the tolerance for missed matches and false matches is acceptable to the business.

Remember that no perfect matching system exists that will ensure complete correctness all of the time. As such, it is critical that you understand and define your business’ tolerance for missed matches as well as false matches. This involves testing iteratively the modifications being made to the critical data, scoring, and survivorship. It will require your data stewards to inspect, assess, and provide feedback on the results to ensure both accuracy and completeness of the data that has been processed. Engage your data stewards and business analysts in the testing and validation of changes made to critical data, match scoring, and survivorship processing.


SurvivorshipThe goal of implementing master data management in an enterprise is to provide the single version of the truth – in this case, of the customer. But what is that single version of the truth and how does one define it? As we’ve seen, defining a matching process takes care of identifying the duplicates and suspects. Survivorship, on the other hand, is the process that takes action on the suspect records found and, if required, can take care of the remediation of overlapping and sometimes inconsistent information.

The process of survivorship in InfoSphere MDM Server involves:

• The actions that will be taken on a party and its suspects if a guaranteed duplicate party has been identified by the matching process

• Using a set of well-defined criteria determine what information will be carried forward to compose the physical or virtual single, master view of the party.

• (Optional) Providing notifications to other systems that a merge or suspect has been created in the MDM system.

When you configure your system to leverage the Suspect Duplicate Processing features of InfoSphere MDM Server, if a suspect, or suspected duplicate party, is identified for a new party being entered into the system, one of two actions may result:

1. The new party information may be immediately merged with the data already stored for the suspected duplicate

2. The new party can be persisted and the suspects may be resolved later by a data steward.

As part of these actions, the system may be configured to provide notifications of the event and provide the details required to action other processing extraneous to InfoSphere MDM Server if required.

If the course of action to be taken involves merging duplicate parties, the criteria defined to ensure only the most acceptable information is propagated and merged into the party are applied. These are what are termed the survivorship rules.

This section provides information about InfoSphere MDM Server’s survivorship functionality and provides guidance and recommendations for two approaches – the physical de-duplication, or merging of party information and conditional physical merging of party information.


Planning for Survivorship Processing Consider the following when determining and further defining the approach to survivorship that best fits your business needs:

• To merge or not to merge?

o Determine whether you will consider all or only a subset of records within which to physically de-duplicate.

o Sometimes it may be desirable to keep everything and virtually merge to create a view of what the golden customer might look like, but not actually create a consolidated record. This is referred to as the Aggregate Party View feature of the Suspect Duplicate Processing of InfoSphere MDM Server.

• If merging – who or what will do the merging?

o Will you enable guaranteed duplicates to be processed by the system in real-time or asynchronously?

o Consider if and how you would like data stewards to process and resolve other probable matches. We discuss the suspect action types and some implementation considerations later in this section.If merging – how will the data get merged?

• Identify the set of criteria upon which customer data will be favored and retained during survivorship. We provide examples of criteria commonly considered later in this section.

Merging Guaranteed Duplicate PartiesMaintaining numerous silos that store similar customer information can be costly. With a drive towards smarter, greener business, merging customer information and reducing the number of systems required to source master data is a logical choice. InfoSphere MDM Server provides the MDM services and repository; becoming the system of record for a business’ customer master data. It also provides an audit trail of what has happened to the party or parties as they are merged and further updated in the system. It is a business decision to merge guaranteed duplicate parties found during the match process and there are a number of ways to merge them, both automatically and manually. Typically, businesses that choose to physically merge data end up doing some of each.

You should merge and consolidate guaranteed duplicate party information.


Other Suspect Action Categories

The bulk of this section focuses on the guaranteed matches, or suspects categorized as A1 matches. However there are other suspect types that are important to consider for the SDP implementation.

Match Category / Suspect Type Name

Description

A1 The parties are definitely the same.

A2 It is highly probable that the parties are the same. That is, these are really guaranteed matches, but need a data steward to review a particular piece of critical data to confirm.

B The parties are possibly the same.

C The parties are definitely not the same.

Actions Taken on Highly Probable Matches (The A2 Match)

Consider how you would like to process highly probable matches (A2 matches). In some cases you might simply want to add highly probable matches to the system and create a suspect entry between the parties.

At other times, you might want to provide a listing of existing probable matches and provide the user with an opportunity to select one from these. A2 matches could be considered guaranteed duplicates where one of the parties has an error in critical data. This processing is most useful when there is a user interface that provides data stewards with a means to actually select an A2 match and update it directly, rather than creating a new party in the system.

InfoSphere MDM Server provides you with the ability to configure the system to return any highly probable matches without taking any action on the suspects or adding the source party to the system immediately. In this case no suspect entries would be created.


Tip:

In the CONFIGELEMENT table, the /IBM/Party/SuspectProcessing/AddParty/returnSuspect configuration entry governs the ability to return the list of potential A2 suspects rather than adding the source party to the system. By default, this configuration is enabled.

To leverage the functionality, execute the addParty transaction, setting the <MandatorySearchDone> element of the Party business object to No.

If there are no guaranteed matches, a listing of all the highly probable matches (A2) will be returned and the party supplied as input to the addParty service will not have been added to the system.

In some implementations, it may be a requirement to treat suspects of more than one categorization in the same way. That is, the actions you have defined to occur as a result of finding suspects of these categories are the same.

For example, if the default A2 action is not useful for your SDP implementation, you should change the action rule for this type to process these suspects in a fashion similar to B suspects rather than simply eliminating the A2 categorization, changing these matches to B suspects, for example. In this way you still maintain the categorization of the suspect types appropriately so that a data steward can work differentially with suspects having the closer matches first.

We recommend making changes to suspect action processing rule rather than eliminating a suspect category by merging two categories of suspect types into one. Do not fall into the trap of simply re-categorizing the suspect types because the actions taken are similar or the same. Maintain the categorization of the suspect types appropriately so that a data steward can work with the suspects that have the closest match first.

Actions Taken on Possible Matches (The B Match)

Typically when possible matches are found, the party is added to the system and suspect entries are created between these matching parties.

Actions Taken When There is a Failure to Match (The C Match)

Typically suspect records are not written for parties that are definitely not matches. The suspect action rule for this type can be modified in the event that you want to provide an action for matches of these types.

Persisting Guaranteed Duplicate PartiesSome businesses require that identified duplicate parties remain autonomous in the MDM system. When this is the case, it is important to clearly define the criteria that will


make the decision as to whether a party will either be stored or merged with a pre-existing party that has been identified as a guaranteed duplicate of it.

It may be that you require the ability to store and maintain multiple profiles of the same party, by line of business, for example. As such, these party profiles should not be subject to the party merging functionality (suspect processing or collapse processing) available in InfoSphere MDM Server. You do, however, want to collapse duplicates within the line of business where the party profile resides. In this case, you want the suspect processing functionality of InfoSphere MDM Server to resolve duplicates within the line of business automatically and may want data stewards to assist in the remediation or collapsing of parties that are possible, but not guaranteed duplicates.

A second example of a requirement where you might not want to merge duplicates is the case where you have multiple autonomous agencies providing party information to the MDM repository and would like to provide a consolidated view of suspect duplicate parties, but it’s a case where “everybody’s copy of the party information is right” and other than agreeing on the terms of survivorship for a merged view of these duplicate parties, no physical merging or change to the originating copies is permitted, perhaps for security reasons. In this case you might want the suspect processing functionality of InfoSphere MDM Server to only identify duplicates and persist them and leverage the survivorship service provided by the aggregated view of these duplicate parties.

When duplicate parties are intentionally maintained in InfoSphere MDM Server suspect records that get created between these parties are assigned a special suspect status type that indicates that they are not to participate in any collapse or merge activity:

6 – “Parties Suspect Duplicated – Collapse Not Permitted”

Tip:

Even if a suspect is marked to not be automatically processed by using suspect processing and merged with guaranteed duplicate parties, a data steward can still explicitly collapse parties between the lines of business if required.

Simply use the update suspect status service to update the suspect status type for the suspect record relating these parties from 6 to another suspect status type, such as 1 (Parties are Suspect Duplicates).

Next, use the collapse party service and explicitly provide the set of unique identifiers of parties to collapse.

Only store and maintain guaranteed duplicate parties when you are not permitted to automatically change party information across lines of business or contributing agencies, for example. The performance of the aggregate party view service will depend on how many and how much data must be processed to create the resulting view.


Data SurvivorshipOnce you have made the decision to merge or not merge guaranteed duplicates, the next step is to ensure the survivorship rules reflect these decisions. In the next sections, the survivorship rules are described and concise recommendations are provided as to where you will want to make changes if survivorship processing enhancements are required to meet your particular business requirement, which is generally the case.

Suspect Action RulesThere are action rules that define the treatment of suspects found for a party by type.

One of the key rules that you need to know about is the “add party” rule of SDP. Specifically, the SupsectAddPartyRule determines what to do with guaranteed duplicates identified during the match process and suspect type adjustment process. The modifications you make to this rule define the criteria and specify the action that should be taken with the party – by either allowing the party to participate in an immediate merge with the suspect party, or by storing the party as is and marking it with a special suspect status type identifying it as a duplicate that should not be collapsed (More information on the default rule implementations to modify may be found in the InfoSphere MDM Server Developer’s Guide).

If you are persisting duplicates, this rule will additionally invoke a filtering rule which removes any suspects that should not be collapsed with the incoming party information. You must modify the BestFilteredSuspectsRule to include your criteria for including suspects in the filtered list. By default, the BestFilteredSuspectsRule, for example, will pull only the best matching suspect, if there is one that shares the same LOB as the incoming party. The SuspectAddParty rule then uses only these suspects, if any, to merge with an existing party or add new.

We recommend that you clearly define the factors, or criteria, that differentiate between the case to automatically merge (update), persist the duplicate, or add a new party to the system.

The A1SuspectsActionRule determines when and how to write suspect records for incoming parties identified as potential duplicates. Rules also exist for the other suspect types: A2 (high probability of being a duplicate), B (might be a duplicate) and C (not duplicate) suspects. It is from these rules that notifications are generated to let other systems know that suspects have been created for an existing party in InfoSphere MDM Server.

Consider the use of notifications if you need to inform other systems of suspect identification.

Suspect Adjustment RulesOnce the matching process has been completed there is an opportunity to override and adjust the assigned suspect type as well as the suspect status type if an upgrade or downgrade is required.


For example, there might be specific instances where the values in particular non-critical data elements matter and will impact type of match made by the system. For example, consider the case where two parties match as possible duplicates (A2) but because the parties are both female and differ only on the last name critical data element you might make the assumption that one of these names is a maiden name and upgrade the suspect to a guaranteed duplicate (A1).

The suspect type adjustment rule (PartyMatchCategoryExtRule) as well as the suspect status adjustment rule (AdjustSuspectStatusRule) will need to reflect the changes that have been made to the match category as well as the suspect status type.

In addition, in the near real-time mode of operation, an asynchronous event may be triggered to augment the scores and upgrade or downgrade the categorization of an A2 or B suspect by using the Quality Stage (QS) probabilistic matching engine. The section on matching covers this particular option in more detail.

If there is a specific case where a non-critical data element impacts the categorization of a suspect, use the adjustment rule to upgrade or downgrade the party’s suspect type.

Any upgrade or downgrade in suspect type should be accompanied by an appropriate change in the suspect status type.

In the event that a suspect is upgraded to or downgraded from a guaranteed duplicate, make sure to also adjust the suspect status type appropriately if the criteria for persisting duplicates are different from what is offered by default in the product.

Manually Marking Parties as Suspects

There are services to flag parties as suspects of each other as well. These are the markPartiesAsSuspect and unmarkPartiesAsSuspect services in InfoSphere MDM Server. The markPartiesAsSuspect service creates a suspect entry between two parties that have not yet been identified as suspects by InfoSphere MDM Server matching.

Tip:

The markPartiesAsSuspect service provides useful information about how closely two parties actually match (or not) by providing the match and non-match relevancy scores. These may be used to determine why the parties are not perfect matches based on the critical data elements that they share (match relevancy) or do not share (non-match relevancy).

Note: Use of the service will not show the results of the asynchronous probabilistic augmentation of the deterministic match process.

Both unmarkPartiesAsSuspect and updateSuspectStatus services provide the ability to indicate that parties are not suspects of each other in the event that they have been marked as suspects in error.


Merging Party Information (Add/Update and Collapse)When it has been decided that a party should and will be merged with existing party information, important decisions need to be made about what information should be carried forward to produce the final result. You will need to identify the criteria upon which the data will be favored and retained in the remaining updated or collapsed record. How you choose this criteria will depend on your business needs.

Typically you will add information to the record that was not already there and update to enhance existing information if “better” or more information is provided by one of the parties being merged.

So what factors might influence the decision that information is, in fact, “better”? You might consider as criteria:

• source system factors

o accuracy of data

Depending on how data is collected, the accuracy of the data may be impacted. For example, a phone number added or updated by a call centre is likely to be more accurate than one collected on a website.

o For example, some information may not be accessible to merge into the consolidated record to comply with certain standards or for legal reasons.

• data recency

o How recently was the data added, updated, or verified? Do you want to propagate a piece of information from a system that was last updated 5 years ago, or from a system where it was updated last week? If implementing your MDM solution in a co-existence strategy (some data continues to be maintained in source system, while other data is maintained in InfoSphere MDM Server), you might want to consider the date that the information was last updated in its source system.

Whatever you decide to use to differentiate “good” from “better” information, you’ll want to use the same criteria for all of your survivorship processing. That is, all rules related to survivorship processing should reuse the same logic.

Merging can happen automatically or as the result of an explicit call or action to merge – a request typically spawned by a data steward or as part of the Evergreening process.

Unlike automatic merging of incoming information with found guaranteed duplicates already existing in the system, the collapsing of parties involves taking explicit action to merge two parties. For example, this may be based on additional criteria that may only be privy to the data steward resolving the duplicate.


Regardless of what spawned merge or collapse processing, keep the survivorship logic the same. That is, any enhancements being made to the collapse and guaranteed duplicate update survivorship logic should be applied consistently to all rules.

Synchronizing With Source SystemsIf you are merging guaranteed duplicate parties, the physically persisted consolidated record becomes the “golden record” for a party. The information in this record may need to be fed back to the source systems or published to downstream systems. The need to synchronize the “golden record” back to source systems will really depend on the implementation style of your MDM solution.

It is typical that the early phases of an MDM solution are implemented in either a consolidation or co-existence style.

In the consolidation style of implementation, party data is updated only in the source system. InfoSphere MDM Server is a system of reference for the consolidated data and is typically used to feed downstream systems for analytical purpose or to serve as read-only source of trusted data for other operational systems. With this implementation style, the “golden record” is not fed back to source systems. However, it is important to make sure that source systems employ a proper delta sensing framework so that only newly updated data is provided to InfoSphere MDM Server as part of the delta load.

With the coexistence style of implementation, the party data is updated in both the source systems and in InfoSphere MDM Server. InfoSphere MDM Server in this case is not the system of record because it is not the single place where master data is authored. As such synchronization of the party data with the source systems is recommended. This data synchronization should be bi-directional and special care should be taken to avoid the situations when updates from the source system conflict with the updates from InfoSphere MDM Server.

With the centralized implementation style, InfoSphere MDM Server is the system of record and the only place where master data gets authored. Any updates to the “golden record” should be distributed back to source systems.

The Aggregate View of the PartyWhen you automatically merge parties, the golden record is always available through the related inquiry services of InfoSphere MDM Server. On the other hand, when you do not merge all duplicate parties into a single record, you might want to see what it would have looked like if the full, physical consolidation had taken place. For this reason, InfoSphere MDM Server provides a service that allows for an aggregate view of the party to do just that.

Regardless, if one of these scenarios applies to your situation, or if you have another reason to persist duplicate parties, it is important to remember to minimize the number of candidate duplicate parties to be merged for viewing at one time. The more candidates there are involved in survivorship processing, the longer it will take to generate the single aggregated view. For further information on the aggregated party view service itself, see the InfoSphere MDM Server Transaction Reference Guide.


Minimize as much as possible the size of the guaranteed duplicate party pool that is required to produce the aggregate view of the party. The performance of the aggregate party view service will depend on how many items and how much data must be processed to create the resulting view.


Data Load StrategiesThe strategy you take for loading data is an important aspect of SDP planning. InfoSphere MDM Server maintains a complete and accurate view of master data by consolidating data from existing systems into a physical master repository. Commonly, data from the relevant source systems are batch loaded into the InfoSphere MDM Server repository during the first phase of an InfoSphere MDM Server implementation. We will refer to this process as the “initial data load”. Sometimes a “trickle-feed” approach to the initial load process is taken. In this case, the data is first sent to InfoSphere MDM Server to be consolidated when it has been updated in the source systems.

After the initial data load has been done, if any master data gets changed in the source system, it must be synchronized with InfoSphere MDM Server. Depending on your business requirements, the synchronization technology you employ can be different. For example, the near-real-time synchronization requirement is often implemented as frequent delta batch loads. The real-time synchronization requirement, on the other hand, is typically implemented as direct SOA service calls to InfoSphere MDM Server.

The choices you make - for an initial data load strategy and a subsequent data synchronization approach - are generally driven by your non-functional requirements. These include: the amount of data to be loaded, the latency allowances, the transactional requirements, the time allocated for the load process, and the technical capabilities of the source systems in place.

You should disable SDP during the initial data load process. Perform suspect duplicate processing as part of the Evergreening process after the initial set of data has been loaded into InfoSphere MDM Server. By decoupling the SDP logic from the load process you achieve faster load times.

The Evergreen application within InfoSphere InfoSphere MDM Server provides the capacity to identify potential duplicate parties that might have been introduced into the database and the ability to collapse parties without human intervention, using rules to support data survival.

Typical Recommended Approaches to Loading DataThis section describes the recommended and typical approaches to loading data, the pros and cons of each and impact on the SDP process.

Conducting a batch load most often involves the InfoSphere MDM Server batch processor component. The batch processor serves to invoke certain InfoSphere MDM Server services that were designed specifically to handle the loading of data. These services are called the “maintenance services” and are part of the InfoSphere MDM Server product.


Alternatively, loading data could be implemented as an extract transform load (ETL) job that writes data directly into InfoSphere MDM Server repository. This is sometimes referred to as the “direct data load” approach. InfoSphere MDM Server provides a direct data load solution as part of the Rapid Deployment Package (RDP). The direct data load process for this offering has been implemented as Data Stage jobs and Quality Stage jobs.

Use of batch load via services for both the initial and delta loads

Use the maintenance runtime services for both initial data loads and delta loads into InfoSphere MDM Server, with SDP turned OFF during initial data load. This approach provides not only good performance, but ease of implementation and maintenance. This approach will require Evergreening for suspect duplicate processing after the initial load and prior to delta load.

This approach has no dependency on the RDP assets. All of the audit and traceability feature functionality on information being added to the InfoSphere MDM Server system remains available and intact using this approach. This means that if you have source data lineage requirements (the ability to retrieve the source data from InfoSphere MDM Server in exactly the same form as it existed in source system) you’re all set. For example, you may want to know that during the load process certain parties were updated as a result of an A1 merge on the history database if history triggers are set. Also to note, any required data additions and extensions or changes to suspect processing or standardization rules will need to be implemented only once – for the InfoSphere MDM Server runtime.

Enabling SDP during initial load will impact performance and should be considered only if load timelines allow for it.

Use of direct data load (with SDP OFF) for the initial data load followed by batch load by using the maintenance services for the delta loads

When looking for faster initial data load, consider using the InfoSphere DataStage jobs provided as part of the InfoSphere MDM Server RDP and loading your data directly into the InfoSphere MDM Server database. Keep SDP turned OFF during initial data load. Choose this solution only if the batch loads using the services can not finish the initial load within the time constraints that you have (that is, you have a large data volume to be loaded in a very short time). Note that any data extensions or data additions that have been developed for InfoSphere MDM Server entities must be implemented for both the InfoSphere MDM Server runtime as well as in the InfoSphere Data Stage jobs responsible for loading the data.

This approach will also require running Evergreening for suspect duplicate processing after the initial load and prior and to conducting delta loads. When executing a delta load using the maintenance services of InfoSphere MDM Server, SDP should be turned ON.


It is important to ensure that the following aspects of SDP are identical between InfoSphere MDM Server runtime and direct load when this load approach is taken:

o The InfoSphere MDM Server runtime must be configured to invoke IBM InfoSphere Information Server Quality Stage standardization. The standardization QS job invoked by the InfoSphere MDM Server runtime must be updated to use the same standardization rule set as the Quality Stage job used as part of direct load and part of the RDP assets.

o The default configuration for the InfoSphere MDM Server phonetic algorithm must be replaced by the empty implementation. Phonetic keys should be generated by the Quality Stage standardization job.

Data for additional source systems being loaded for the first time should be added using the maintenance services.

Use of direct data load (with SDP ON) for the initial load followed by batch load using the maintenance services for the delta loads

The InfoSphere DataStage jobs and QualityStage jobs from the InfoSphere MDM Server RDP could be used for loading data directly into InfoSphere MDM Server database with SDP logic being executed during the initial data load. With this combination of load strategies and configurations the initial load might be slower than direct load with SDP OFF.

The decision to enable SDP as part of direct initial data load has several implications:

• Source data lineage requirements

The decision to perform SDP as part of initial data load has impact on source date lineage. Source data lineage is the ability to retrieve the source data from InfoSphere MDM Server in exactly the same form as it existed in source system.

Typically, when a new party is being added to InfoSphere MDM Server and it is identified as exact duplicate of existing record, the add action becomes an update and InfoSphere MDM Server retains only one record of the party. Changes to the party can be traced using history tables. With an alternative SDP configuration, InfoSphere MDM Server could simply store the incoming duplicate record and mark it as duplicate, leaving the collapse step to the Data Steward or to the Evergreening process. This configuration provides the best source data lineage.

When performing a direct initial load to InfoSphere MDM server using the InfoSphere DataStage jobs and InfoSphere QualityStage jobs for RDP with suspect duplicate processing enabled, all duplicate parties (A1 matches) are merged prior to the actual load and only the final surviving party data is loaded into InfoSphere MDM Server. All of the original party records do not get loaded and hence no source data lineage is available in InfoSphere MDM Server.


When using the maintenance services for the delta loads, the normal audit trail of party records entering InfoSphere MDM Server will be available. At this point, any updates to a party as a result of an A1 merge will be made available on the history database if history triggers are set.

• SDP customization efforts

If the direct data load approach is taken for the initial load and suspect processing is done as part of the load process, you must consider the development cost of customizing SDP in both direct load and InfoSphere MDM Server runtime. Changes to suspect duplicate processing or standardization rules and formatting will need to be done twice, once for the load jobs and once for InfoSphere MDM Server runtime.

In the initial implementation phase, InfoSphere MDM Server may not have consuming channels sending data to it in real time by using SOA services. However, the InfoSphere MDM Server Data Stewardship User Interface (DSUI) uses SOA services to manage suspect duplicate parties and, therefore, the InfoSphere MDM Server runtime services must have identical SDP and standardization behavior as the direct load jobs.

It is crucial to ensure that the following aspects of SDP are identical between InfoSphere MDM Server runtime and direct load:

o Critical Data Elements - The definition of critical data elements must be the same in InfoSphere MDM Server runtime and in direct load InfoSphere DataStage jobs.

o Candidate Selection Process - The suspect search rule being picked up for use by the InfoSphere MDM Server runtime must have the same logic as the blocking algorithm used by the InfoSphere DataStage jobs and InfoSphere QualityStage jobs.

o Matching and Suspect Categorization Process - The InfoSphere MDM Server runtime must be configured to invoke InfoSphere QualityStage matching. The matching InfoSphere QualityStage job invoked by runtime must be updated to use the same matching rule set as InfoSphere QualityStage job used as part of the direct load.

o Survivorship Rules - The business logic for surviving data from duplicate parties during auto-collapse must be the same between InfoSphere MDM Server runtime and the direct load jobs. If you choose to persist duplicates and prevent the automatic collapse or merge of parties based on some criteria (for example, LOB), the criteria you use to make the decision to persist the parties must be implemented in both the InfoSphere MDM Server runtime rules and the direct data load jobs in an identical fashion.

o Standardization and Phonetics - The InfoSphere MDM Server runtime must be configured to invoke InfoSphere QualityStage standardization. The


standardization InfoSphere QualityStage job invoked by runtime must be updated to use the same standardization rule set as InfoSphere QualityStage job that is used as part of the direct load. In addition, the InfoSphere MDM Server phonetic key generation algorithm must be replaced by the empty implementation and phonetic keys should be generated by the InfoSphere QualityStage standardization job.

In addition to SDP configuration efforts, any data extensions or data additions developed for InfoSphere MDM Server entities must be implemented in both InfoSphere MDM Server runtime and in the InfoSphere DataStage jobs responsible for loading the data.

Adding New External System After Initial LoadIt is common that new source systems may need to be loaded into InfoSphere MDM Server solution in later implementation phases. This activity requires special planning from the SDP point of view. You should not treat any new source system data as a routine delta load.

We recommend repeating the data quality analysis phase on the new source system data to ensure that the existing standardization and matching rules are still applicable after the introduction of the new set of data into InfoSphere MDM Server.

You might have to adjust the candidate selection, matching and data survivorship rules. Failure to adjust the candidate selection rules could result in performance degradation of SDP logic due to un-expectedly large candidate selection pool or due to missed candidates. Failure to adjust the matching and survivorship rules could result in missed matches or incorrectly auto-collapsed parties.

Data for additional source systems being loaded for the first time should be added by using the maintenance services.

The InfoSphere DataStage jobs and InfoSphere QualityStage jobs for RDP (initial load option) can be used to load data, but taking this approach results in no check being made for potential duplicate parties or addresses that already exist in InfoSphere MDM Server repository. As such Evergreening must be executed for each of the additional systems to identify duplicate parties. The Evergreening process will not ensure the removal of duplicate addresses in the address table. Party Addresses, or the addresses as they are assigned to a party, will be de-duplicated.


Best Practices• Perform data discovery and analysis as a first step for your Master

Data Management project.

• Select your critical data elements based on the data discovery and analysis results and continuously monitor the quality of the data being provided to the system to ensure critical data element selection remains valid.

• Standardize free form data designated as critical data elements, such as name, address, phone number, and identifier.

• Choose standardization and phonetics components suitable for your geographical region.

• The party search and matching rules should employ the same critical data elements that have been selected as a result of your data profiling activities.

• Work iteratively to refine your search and matching rules to ensure the tolerance for missed matches and false matches is acceptable to the business.

• Optimize your database indexes when you modify your search criteria to improve the performance of your search process.

• Use the Smart Inquiries feature to configure and avoid making unnecessary queries to empty or unused parts of the data model.

• Go through the match matrices and determine which combinations of critical data elements accurately represent each suspect categorization for your business.

• Consider if and how you will manage the volume of suspects identified by the matching process.


Best Practices• Work iteratively to refine your search and matching rules to ensure

that the tolerance for missed matches and false matches is acceptable to the business.

• Merge and consolidate guaranteed duplicate party information whenever possible.

• Making changes to a suspect action processing rule rather than eliminate an entire suspect category by merging two categories of suspect types into one.

• Do not fall into the trap of simply re-categorizing the suspect types because the actions taken are similar or the same. Maintain the categorization of the suspect types appropriately so that a data steward can work with the suspects that have the closest match first.

• If you must maintain duplicate parties, clearly define the criteria that differentiate the case to automatically merge (update) or maintain the duplicate as new party to the system.

• (Deterministic matching only) When the value of a non-critical data element can impact the categorization of a suspect, use the suspect adjustment rules to upgrade or downgrade the party’s suspect type and suspect status type.

• Keep the criteria for your survivorship rules consistent if multiple methods of survivorship are applied in the same solution (collapse and aggregating party logic).

• Ensure that consistent logic (critical data elements, candidate selection process, matching and suspect categorization process, survivorship rules, standardization and phonetics) is used if multiple data feed approaches are used as part of the same solution.

• Minimize as much as possible the size of the guaranteed duplicate


Best Practicesparty pool required to produce an aggregate view of the party, which is used when persisting duplicates. The performance of the aggregate party view service will depend on how many items and how much data must be processed to create the resulting view.

• Load data into InfoSphere MDM Server by using the maintenance runtime services for both initial and delta loads, with SDP turned OFF during the initial data load.

• Repeat data discovery and quality analysis cycle and adjust SDP logic when adding new source system to existing InfoSphere MDM Server solution.

• Data for any additional source systems being loaded for the first time should be added by using the InfoSphere MDM Server maintenance services.


ConclusionMaster Data Management is directed to organizations that recognize the value of maintaining the quality of their master data. By incorporating the InfoSphere MDM Server suspect duplicate processing (SDP) feature into your master data management solution, you can significantly improve the quality and streamline the maintenance of your customer information. In conjunction with data governance strategies, it is not difficult to realize the significant cost savings of cleansing, de-duplicating, and consolidating this master data. However for the best result it is important to follow the best practices provided in this document that are most pertinent to your InfoSphere MDM Server solution.


Further readingOther documentation with information on related topics:

• InfoSphere Master Data Management Server Information Center http://publib.boulder.ibm.com/infocenter/mdm/v9r0/index.jsp

• InfoSphere Master Data Management Server Developer’s Guide, Version 9.x

• InfoSphere Master Data Management Server Version 9.x, Developer’s Samples Guide, "Working with Party Maintenance Services"

• Enterprise Master Data Management: An SOA Approach to Managing Core Information (2008). IBM Press. Dreibelbis, A., Hechler, E., Milman, I., Oberhofer, M., vanRun, P., Wolfson, D.

• InfoSphere Foundation Tools – The Entry Point to your Information Agenda http://www-01.ibm.com/software/data/information-agenda/foundation-tools.html

• IBM WebSphere Information Analyzer and Data Quality Assessment http://www.redbooks.ibm.com/abstracts/sg247508.html

• Loading a large volume of Master Data Management data quickly, Part 1: Using RDP InfoSphere MDM Server maintenance services batch http://www.ibm.com/developerworks/data/library/techarticle/dm-0908mdmdataload/

• Loading a large volume of Master Data Management data quickly, Part 2: Using Rapid Deployment Package direct load with InfoSphere MDM Server http://www.ibm.com/developerworks/data/library/techarticle/dm-1007mdmdataload2/index.html

• Master Data Management: Rapid Deployment Package for MDM http://www.redbooks.ibm.com/abstracts/sg247704.html?Open

http://www.redbooks.ibm.com/abstracts/sg247704.html?Open

http://www.ibm.com/developerworks/data/library/techarticle/dm-1007mdmdataload2/index.html

http://www.ibm.com/developerworks/data/library/techarticle/dm-1007mdmdataload2/index.html

http://www.ibm.com/developerworks/data/library/techarticle/dm-0908mdmdataload/

http://www.ibm.com/developerworks/data/library/techarticle/dm-0908mdmdataload/

http://www.redbooks.ibm.com/abstracts/sg247508.html

http://www-01.ibm.com/software/data/information-agenda/foundation-tools.html

http://www-01.ibm.com/software/data/information-agenda/foundation-tools.html

http://publib.boulder.ibm.com/infocenter/mdm/v9r0/index.jsp


ContributorsThe authors would like to recognize the following individuals for their feedback on this paper and their contributions to this suspect duplicate processing topic:

David Borean

Lead Product Architect, InfoSphere MDM Server

Karen Chouinard

InfoSphere Servers – Competency Center, Operations, Process Reengineering

Christine Davis

Business Architect, InfoSphere MDM Server Lab Services

Gordon Gifford

Senior Technical Specialist, InfoSphere MDM Server Lab Services

Adam Muise

InfoSphere Worldwide Sales

Linda Park

Manager of Business Requirements, InfoSphere MDM Server

John Thomas

WW MDM Server Competency Manager

Paul van Run

Senior Technical Staff Member, Chief MDM Architect

Lucy Xia

L2 Support Analyst, InfoSphere MDM Server

Michelle Corbin

Information Architect, ID Team Lead, InfoSphere Information Server


Appendix A: Configuration Tips for the Data Standardization Feature

Data standardization is a time consuming process and because there are a number of ways to configure this very flexible process in InfoSphere MDM Server, we have seen cases where it has been configured to repeat more than once within a single add/update/SearchParty service. It is not typical to require repeated standardization within a service call.

In InfoSphere MDM Server version 9.0 and higher, enable (that is, set the value column to true) the following attributes in the configuration and management component:

/IBM/ThirdPartyAdapters/IIS/StandardizeAddress/StandardFormattingIndicator/enabled /IBM/ThirdPartyAdapters/IIS/StandardizePhoneNumber/StandardFormattingIndicator/enabled

Enable the StandardFormattingIndicator settings in the InfoSphere MDM Server configuration and management component to run standardization once during impacted services.


Appendix B: Quick Steps for Customizing the Criteria for Persisting Duplicates

Customizing the set of rules involved in making the decision to persist duplicates involves:

• Ensuring that during the initial matching process, only the suspects sharing the same line of business get matched and merged. Filter out anything that does not meet this criterion.

• Ensuring that during the suspect adjustment process, if there is an upgrade to a suspect that was originally only a probable match to a guaranteed match, it should be not be involved in the merge if the line of business is different.

Tip:

Using the default set of rules for this feature to first see the persistence of duplicates in action:

1. Add a party with identical data to the system twice, only providing different values for the RelatedLobType or RelatedLobValue fields of the TCRMPartyLobRelationshipBObj. (The CDLOBTP provides line of business type codes and values to choose from.)

2. Query the CONTACT table and notice the same party data gets persisted twice.

Important: Do not confuse RelatedLobType/Value fields with the LobRelationshipType/Value fields. The LobRelationshipType/Value fields denote the role that a party plays within the line of business, and not the line of business type itself.


What follows is a checklist for ensuring you’ve customized the appropriate rules for persisting duplicate parties.

1. Ensure system is appropriately configured to persist duplicates as part of SDP Configuration

For example, the configuration should be (at minimum):

Feature CONFIGELEMENT.NAME CONFIGELEMENT.VALUE

SDP configuration – run for add/update services

/IBM/Party/SuspectProcessing/enabled true

SDP configuration – persist duplicates (based on particular criteria; default – if parties are within the same Line of Business).

/IBM/Party/PersistDuplicateParties/enabled true

2. Customize appropriately the suspect pool filtering rule (BestFilteredSuspectsRule).

3. Customize, if necessary, the suspect type adjustment rule (AdjustSuspectStatusRule)

Tip:

The source code to the external rules is available with the product distribution.

To replace these rules you have a number of choices. You can create your own new rule class and configure it in the rules framework, or customize the code in the existing rule and leave the existing configuration for that rule in place. Typically implementation teams extend the provided rule class and override only the methods containing logic that requires changes. The configuration for that rule is then replaced in the External Rule Component.


NoticesThis information was developed for products and services offered in the Canada.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to:

IBM Director of LicensingIBM CorporationNorth Castle DriveArmonk, NY 10504-1785U.S.A.

For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country/region or send inquiries, in writing, to:

Intellectual Property LicensingLegal and Intellectual Property LawIBM Japan Ltd.1623-14, Shimotsuruma, Yamato-shiKanagawa 242-8502 Japan

The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

This document may provide links or references to non-IBM Web sites and resources. IBM makes no representations, warranties, or other commitments whatsoever about any non-IBM Web sites or third-party resources that may be referenced, accessible from, or linked from this document. A link to a non-IBM Web site does not mean that IBM endorses the content or use of such Web site or its owner. In addition, IBM is not a party to or responsible for any transactions you may enter into with third parties, even if you learn of such parties (or use a link to such parties) from an IBM site. Accordingly, you acknowledge and agree that IBM is not responsible for the availability of such external sites or resources, and is not responsible or liable for any content, services, products, or other materials on or available from those sites or resources. Any software provided by third parties is subject to the terms and conditions of the license that accompanies that software.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.


Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information that has been exchanged, should contact:

IBM Canada Limited Office of the Lab Director 8200 Warden Avenue Markham, Ontario L6G 1C7 CANADA

Such information may be available, subject to appropriate terms and conditions, including in some cases payment of a fee.

The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement, or any equivalent agreement between us.

Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

Each copy or any portion of these sample programs or any derivative work must include a copyright notice as follows:

© (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. © Copyright IBM Corp. _enter the year or years_. All rights reserved.

If you are viewing this information softcopy, the photographs and color illustrations may not appear.


TrademarksIBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.

Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.

http://www.ibm.com/legal/copytrade.shtml

mdm server suspect duplicate processing

Documents