data cleansing guide - v-techv-tech.com.au/cms/download/data cleansing guide.pdf · data cleansing...

Virgil Technologies – Your Partner in Identity Management
Data Cleansing Guide
5
Version 1.0 • Document Owner: Virgil Titeu • Last updated: 3/10/200
www.v-tech.com.au


Document Control Version Owner Distribution

1.0 Initial Release Virgil Titeu v-tech

Virgil Technologies All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording,

r otherwise, without the prior written permission of the publisher o49 Larkspur Circuit • Glen Waverley • Victoria • Australia Phone (+61) 0 417 508 923 • Fax (+61) 0 3 9560 5986 http://www.v-tech.com.au • e-mail [email protected]

www.v-tech.com.au

http://www.v-tech.com.au/


Table of Contents 1 Introduction ________________________________________________________________________________ 4 2 Definitions and Categories ____________________________________________________________________ 4

2.1 Entities in Identity Management _____________________________________________________________ 4 2.2 Areas of Data cleansing ____________________________________________________________________ 5 2.3 Data loading and production cut-over ________________________________________________________ 6

3 PERSON Entity Data Cleansing________________________________________________________________ 7 3.1 Off-line Data Joining ______________________________________________________________________ 8 3.2 Off-line Data Mapping ____________________________________________________________________ 11 3.3 Off-line Data Normalization _______________________________________________________________ 12 3.4 Off-line Data loading and production cutover _________________________________________________ 13 3.5 In-line Data Joining ______________________________________________________________________ 15 3.6 In-line Data Mapping _____________________________________________________________________ 18 3.7 In-line Data Normalization ________________________________________________________________ 18 3.8 In-line Data loading and production cutover __________________________________________________ 20

4 Non-PERSON Entities Data Cleansing _________________________________________________________ 22 4.1 Off-line Data Joining _____________________________________________________________________ 23 4.2 Off-line Data Mapping ____________________________________________________________________ 24 4.3 Off-line Data Normalization _______________________________________________________________ 24 4.4 Off-line Data loading and production cutover _________________________________________________ 25 4.5 In-line Data Joining ______________________________________________________________________ 25 4.6 In-line Data Mapping _____________________________________________________________________ 26 4.7 In-line Data Normalization ________________________________________________________________ 26 4.8 In-line Data loading and production cutover __________________________________________________ 26

www.v-tech.com.au


1

2

Introduction This manual is addressed to vendors, consultants and customers involved in the implementation and delivery of Identity Management projects.

Over several years of involvement in Identity Management projects, it became obvious that there is no consistent and measurable approach in Data cleansing. Vendors, consulting firms and customers are relying upon each other’s best effort without having a methodology towards an expected measurable result. This manual should be used as a guide to provide a delivery metrics on which all the involved parties can agree to the quality of data delivered. Nevertheless, the result is not going to be perfect, however the issues and the corresponding corrective actions should be mapped forward, so that customer’s security is in a controlled risk environment.

This manual is focussed on issues raised during Meta-Directories implementations. The Data Cleansing in Virtual Directories implementations will be addressed in a separate guide.

Clarification on the information available in this guide can be provided as a consulting service. Please contact Virgil Technologies to arrange for a consulting session.

Definitions and Categories It will be risky to embark on this exercise without first clarifying the terminology and intent of wording. The main areas of concern are the Entities types, Areas and activities of data cleansing as well as the Process to take it to delivery known as loading and production cut-over. It is left to the engagement governance and process to assign the roles and responsibilities to different parties involved.

2.1 Entities in Identity Management Are there more types of entities in Identity Management?

Do we want to focus on all; are there any risks for leaving anything out?

Just some of the questions not yet fully answered!

Yes, there are multiple types of entities in Identity Management space. Insofar Identity Management projects have mainly been concerned and attempted to address PERSON type of entities. In today’s complex IT environment, there are also entities related to applications, equipment, and fictional ones, empowering the functionality delivered (admin, training, test, etc.). However, it has to be recognised that by not addressing the other types of entities, there are still significant security risks left exposed.

A basic classification of Entities to be considered will be:

• PERSON – this type of entity is defined by the existence of a physical human persona

4 Commercial in Confidence


•

•

•

•

•

•

•

APPLICATIONS – every application which requires to be run as a service and creates accounts in the underlying layer of software (Operating System)

NETWORK – all the hardware equipment deployed in order to deliver the IT Operations capability (servers, routers, switches, SAN, etc.)

FUNCTIONAL – all the generic entities created for the delivery of administrative functions (admin, test, training, templates, etc.)

ORGANIZATIONS – all types of organizational entities, like suppliers, business partners(or their representatives) acting in an organizational capacity.

Each type of entity has some specific defining characteristics and security requirements It can be argued that this entities division is complicating the landscape, therefore we can agree to reduce it initially to PERSON and NON-PERSON types.

2.2 Areas of Data cleansing What are you trying to achieve?

Security, increased functionality or aesthetics?

In view of the effort and financial cost you are undertaking, all of the above are desirable as well as achievable. Yes, you aim to identify every entity across several repositories and link all that information to only one entry in your Meta-directory. Also you are required to provide more information associated with that entry, which currently resides in disparate repositories and is difficult to assemble in real time (different situation for Virtual Directories). The new information should be available in formats, which are standards or industry acceptance norm in order to be re-used at a later stage by different existing or new applications and technologies. The main activities identified will be:

DATA Joining – it means ensuring that the Meta Directory entity will be a correct reflection of the entity data across selected repositories associated with that entity only. This area will address your security requirements.

DATA Mapping – it means collating in the Meta-Directory, the additional desired attributes, currently available in disparate repositories. This area will enhance the available functionality of your project

DATA Normalization – it means formatting the data archiving and presentation to a consistent format according to industry norms or standards for the selected attribute. This area will deliver the user experience and applications inter-operability.

The main concern, which the Identity Management projects are attempting to address, is reliable, consistent and persistent DATA Joining; therefore most effort will be placed onto this activity.

Data Mapping is relatively straightforward to be achieved once the procedures for DATA Joining have been agreed.

DATA Normalization, firstly requires the parties to adhere to a certain storage and presentation standard and weigh the cost/benefit ratio in delivering this within first project stage. It has to be assessed it’s worthiness in engaging expensive project resources in addressing the normalization or rather in establishing the framework and



address it as an on-going concern. When efforts to address the Data Normalization issues, it has to deal with data at the different layers:

•

•

•

Data STORAGE layer – at this layer it is advisable that data will be present in the most widely accepted standard, local, international, corporate

Data MANIPULATION layer - this is the layer where data gets shared between different applications deployed within the environment

Data PRESENTATION layer – at this layer, data should be presented with customer sensitive locator features (eg. Australian phone numbers should be 8 digits when the customer coems from an IP within the same state – 9825 3316, have the state prefix in case that home state cannot be determined – 03 9825 3316 and present the country code when the visitor comes from a known international location +61 3 9825 3316.

As much as all of these activities can be resolved in a programmatic way, all parties will have to accept that it also involves some manual processing. However, this guide is trying to provide a framework where the scope and the extent of manual processing can be qualified, quantified and agreed prior to any undertakings.

Therefore the data cleansing in it’s totality involves all the processes, technology and effort sustained towards achieving the desired levels of data reliability, consistency, persistency, extensibility and normalization, as well as providing the persistency once it is loaded in the Meta-directory.

2.3 Data loading and production cut-over You spent money and time, how are you going to maintain the level of quality achieved?

What are the policies and rules you can pass onto an administrator so that there is no requirement to re-iterate the effort?

After all the effort and engagement with different parts of the organization, it comes down to delivering on the promise to your executive board. There are a number of different approaches employed by vendors in this space in order to deliver on their promise. It is hard to qualify which one is better than the other, and they are depending on the approach taken in data joining. All parties will have to bear in mind the position of the other parties involved in the project. Usually the vendor is concerned to deliver within the requirements and the costs, the consultancy desires to provide the suitable advice to the case and the customer will aim to something as close as possible to perfection. Mainly the customer will have to accept that the quality comes down to the cost/benefit ratio and that this space will require on-going work and consequently resources to be allocated.

The governance in this space belongs to the consultancy, in order to align the AS-IS business processes to the TO-BE design, as well as the vendor’s technology employed. Also, the consultancy should provide indication to the level of effort and customisation required, as well as secure understanding and agreement regarding the roles and responsibilities during the data Cleansing process.

Based on the data joining scenarios (yet to be discussed) there are a number of different approaches:



•

•

Off-line data joining and marking followed by an initial load with business processes being accounted for in the proposed design functionality. This means that the data joining is performed outside of Meta-directory or any underlying repositories. This is based on the business rules discovered and maintained in the proposed design. Once the data is loaded in the Meta-directory and the connectors are switched on, the connector specification will employ rules designed to maintain and improve the level of consistency and persistency desired.

In-line data joining – this means that the repository with the largest number of qualified entries is first loaded in the Meta-directory. Ideally, in order to deliver the requirements of consistency and persistency, this should be the only one allowed to create entries in the Meta-directory. Subsequent repositories are compared against this data sample and the connector specifications will employ the business rules for joining/matching and feedback exceptions to the respective repositories administrators.

From an effort perspective, the off-line data joining is the most economical, since it can be achieved by using conventional relational databases concepts and tools. The aim is to drive the initial load towards compliance with the TO-BE business rules, since the connectors will be developed to cater for the new operational model.

The In-line model requires significant business process discovery, modelling and coding to accommodate both past and future situations. It probably is best suited to accommodate the inconsistencies in human behaviour and the scaled change management process associated with such an exercise. Unless the vendor is willing to undertake such an effort, it will be difficult for any customer to win such an approach. It provides benefits where significant secondary use of data is envisaged.

Regardless of the chsoen approach, all parties should aim for minimum customer business interruptions. Therefore the production cutover freeze should be minimized to maximum, from overnight to maximum a weekend rollout. It is envisioned that the production cutover is properly preceded by extensive change management and training engagements.

The expectation is that once the deployment is completed the environment will “tick along” and greater quality and benefits will be achieved as time progresses.

3 PERSON Entity Data Cleansing Who’s who in your organization?

What are their attributes?

How would you like to store and present this information?

As specified above the aim of data cleansing in Identity Management project is to achieve reliable, consistent and persistent DATA Joining, Mapping and Normalization. Let’s have a look at a basic model of Meta-directory implementation:


8

Commercial in Confidence


Figure 1. Meta-directory components and Architecture

It is strongly recommended to use no more than three underlying repositories, since any additional repository, will increase the complexity of the initial model exponentially (to be detailed later).

Since, at this stage we focus exclusively on PERSON entities, it makes sense to prepare in advance for what’s ahead by asking your technical administrators to identify PERSON and NON-PERSON entities in their respective repositories. In most organizations, there is at least an empirical approach to deal with environment classification and change.

It is recommended that NON-PERSON entities known to be marked, such that it will enable to be differentiated from PERSON entities. This will not be exhaustive, but it will provide a good basis for further data cleansing activities. Most applications today have an LDAP or LDAP resembling data schema, therefore most of the technical people will already be familiar with attributes and will be quick to identify them. For marking purposes, it is suggested to use an attribute, which does not have any functional/service delivery implications, unless you are willing to go to the extent of extending the schema of the respective application. Let’s say for example sake that in Network Operating System (NOS) we are going to use “Car Licence” and all the NON-PERSON entities will have this attribute set to one “1”.

A similar approach has to be employed for each and every underlying repository, which store both PERSON and NON-PERSON entities; otherwise any data extract will not guarantee the separation between them.

3.1 Off-line Data Joining In order to conduct the joining, the choice of database is with the customer based on factors like licensing and the associated in-house expertise available. It has been done with MS-Access, MS-SQL server, Oracle, MS-Jet Engine, etc.

Thus, we reach the point where there are three underlying repositories with only PERSON entities to be presented for Data Joining, Mapping, and Normalization – cleansing purposes. This will lead us to the following possible situations:

Source 1 Source 2 Source 3 Exception Handling1 Ideal - Authoritative Join2 No-Join Exception- Admin Source 3 to check compliance with the busiess rule ?3 No-Join Exception-Admin Source 2 to check compliance with the busiess rule ?4 No-Join No-Join Exception-Admin Source 2 & 3 to check compliance with the busiess rule ?5 No-Join Exception-Admin Source 1 to check compliance with the busiess rule ?6 No-Join No-Join Exception-Admin Source 1 & 3 to check compliance with the busiess rule ?7 No-Join No-Join Exception-Admin Source 1 & 2 to check compliance with the busiess rule ?8 No-Join No-Join No-Join No concern

9



Join Join JoinJoin JoinJoin JoinJoin

Join JoinJoin

Join

Table 2. Joining case – 3 sources, one attribute each

All the EXCEPTION, situations will have to be processed by the business and technical administrator of the repository marked with No-Join and provide with a statement YES/NO which will be enforced with additional marking in their repository in order to joining or exclusion. The marking has to be reflected within the joining repository as well in order to ensure consistency and persistency.

As we can see there are eight (23) possible situations, coming out of joining three underlying repositories.. Clearly we would not focus on the last, the eighth combination, since that is of no concern in our exercise. Unfortunately this model implicitly assumes that we have an “Authoritative” attribute to perform the joining on. Also, the business in conjunction with the application administrators and vendor/consultant will have to identify if any situation from 2-7 are valid in order to validate the creation of ENTITY entry in the Meta-directory. Upon confirmation of the business rule adherence, the entry is a valid entry and a Meta-directory entry will be created, or if no Meta-directory entry will be created, additional marking has to be deployed in the underlying repositories. This marking will have to be accounted for within the deployed connector specification and development.

Over several iterations, with the marking aid, should reach a situation where the table becomes relevant only to AUTHORITATIVE join. The iterative exercise which requires the Identity Management role together with the business and application administrator to introduce marking, attempt joins and handle NO-JOIN exceptions is known as DATA CLEANSING. Ideally, this joining attribute should be a STRONG attribute, such that the functionality of the application which generates and maintains it, critically depends on it’s format. Therefore, the best will be the NOS logon, email address, etc. It is not recommended to consider First or Last Names in this category, since these consistently are freely typed across repositories and First Names being replaced by Preferred Names in NON-HR repositories. Another ideal situation is when the organization already has an automated provisioning engine, where the joining attribute is created in this engine and passed over to all underlying repositories.

This is based on the assumption that the joining attribute is stored consistently across the three underlying repositories. This situation should become apparent during the business process engagement discovery exercise conducted with both the business owners and with the technical administrators of their respective applications. During this process the engagement has to be two-fold, both at business process transformation, as well as application/data administration. It is possible that during this period and before the data cleansing stage a request to be posted to the customer to carry a joining attribute across the underlying repositories. In order for that to be achievable the engagement model has to be continuos and not burdensome, otherwise the stakeholders will likely become disillusioned and disengaged.

There are situations where the customer’s expectation is that exactly this gap to be filled by the project; therefore a second best approach has to be employed. This will


create a “soft join” and significant work has to be employed to confirm its credibility. This has two possible approaches:

•

•

•

•

•

•

•

Reduce the number of joining repositories

Increase the number of joining attributes

I would recommend that the chosen approach follow the above sequence; firstly reducing the number of joining repositories. This can generate the two only alternatives:

We identify a STRONG attribute for AUTHORITATIVE join, or

Will have to increase the number of joining attributes.

Firstly will focus on the worst possible case, which is that no STRONG attribute to generate AUTHORITATIVE join could be identified. This will be an extreme situation, since at this stage the most likely repositories will be the Network Operating System and the email System. In many IT environments there is a STRONG link between these two applications. However, for the sake of the example, will present the case with extended number of joining attributes. Data from the two repositories will have to be tested for consistency and assuming that there is no AUTHORITATIVE joining attributes, a minimum of 2 “soft join” attributes will have to be identified (eg. Surname & location, Surname & team).

Another factor of consideration is the number of entries in each repository, which could be:

Similar – the business rule will be that everybody who has NOS will have e-mail.

NOS entries number significantly larger than email entries number (>10%) – the business rule suggests that there will be NOS entries without email (some users will only access group email, but will log on individually)

Email entries number significantly larger than NOS entries number (>10%)- there are users who access email via Webmail or similar platforms, which might not require a NOS entry.

The above cases will lead to the following possibilities:


Source 1.1 Source 1.2 Source 2.1 Source 2.2 Exception Handling1 Join2 No Exception Handling -source 2 administrator to confirm3 No Exception Handling -source 2 administrator to confirm4 No Exception Handling -source 1 administrator to confirm5 No No Exception Handling -source 1 & 2 administrators to confirm6 No No No match - possible different entity, allow both entries to be created ?7 No Exception Handling -source 1 administrator to confirm8 No No No match - possible different entity, allow both entries to be created ?9 No No Exception Handling -source 1 & 2 administrators to confirm

10 No No Entry Only in Source 2, create entry ?11 No No No Entry Only in Source 2, create entry ?12 No No No Entry Only in Source 2, create entry ?13 No No Entry Only in Source 1, create entry ?14 No No No Entry Only in Source 1, create entry ?15 No No No Entry Only in Source 1, create entry ?16 No No No No No concern for our purposes

Legend Source 1.1=Source 1, Attribute 1Source 1.2=Source 1, Attribute 2Source 2.1=Source 2, Attribute 1Source 2.1=Source 2, Attribute 1

11



Yes Yes Yes YesYes Yes YesYes Yes YesYes Yes YesYes YesYes Yes

Yes Yes YesYes YesYes Yes

Yes YesYes

YesYes YesYes

Yes

•

•

Table 1. Joining case – 2 sources, 2 attributes each

Assuming that the different repositories will have different number of entries, will therefore have to enable Meta-directory entry creation, based on the administrator advice.

There will be 14 different situations, which possibly could lead to new entries creation. During this process, the business rule, which authorizes the entry creation, will have to be enacted in the system by way of an associated marking. The marking has to be consistent and persistent, therefore for each business rule an additional attribute has to be identified for this purpose, both in the connected systems as well as in the joining repository. By implementing the additional marking, over several iterations will lead to STRONG, AUTHORITATIVE joins across all repositories.

The iterative exercise which requires the Identity Management role together with the business and application administrator to introduce marking, attempt joins and handle NO-JOIN exceptions is known as DATA CLEANSING.

As seen so far, in order to achieve the best level of data joining, the lesser the number of repositories you attempt to join initially, the better. Once you achieved a stable, reliable and persistent reference, it is significantly easier to add additional repositories. Also, ideally, start with repositories where you have a STRONG attribute available. If this is not available, start with the repositories with the largest number of entries (they will help in establishing a larger reference base) and the most business critical functionality (it will provide the best return to your invested effort).

3.2 Off-line Data Mapping Mapping represents the addition of attributes within the joining engine in view of the final upload to the production environment of all the attributes, which will be relevant to future usage of the established entity. The following are some of the criteria in deciding which attributes should be mapped and made available in the Meta-directory:

All attributes used for joining purposes.

All attributes used for marking purposes.


•

•

•

•

•

•

•

•

•

•

•

All attributes which are authoritative in their source repositories

Additional attributes which enhance the identity description.

Additional attributes which owners of the existing or planned applications will use, therefore will be consequently imported in those applications.

Additional attributes which will enhance the Meta-directory available functionality (Authentication, Authorization, Privacy, etc.)

All attributes which will support delivery of additional functionality (workflow, delegation, etc.)

It is critical as part of the mapping exercise to identify the correct Meta-directory attributes where the mapped attributes values will be placed. This will require knowledge of the LDAP or like schema of existing and future repositories to be connected to the Meta-directory.

3.3 Off-line Data Normalization There are numerous tools available on the market in order to provide support for Data Normalization purposes, most of them having a history in Enterprise Application Integration. To the same extent, significant Data Normalization can be achieved programmatically with the tools provided by the Identity Management vendor.

There are several categories of Data Normalization measurement, which you might desire to improve on:

Data consistency

Data completeness

Data correctness

Before embarking onto a significant effort towards improving your Data formats, you should perform a cost/benefit realisation scenario in order to establish the worthiness of such an effort as part of the core project activities, where expensive resources are employed or would you rather achieve this as an on-going exercise. As an on-going activity this can be achieved via the White Pages provided whereas employees, the owners of significant and with up to date knowledge of their current values will be empowered to update them.

In order to assess the appropriate way of Data Normalization activity you would be required to provide clarity onto the following:

Define the existing data storage format (eg. Mobile=0417508923).

Define the future Data storage format (eg. Mobile=+61 (0) 417 508 923).

Identify the amount of effort required to achieve the desired outcome.

In order to achieve a meaningful outcome some clarity has to be brought forward, regarding the types of attributes available:



•

•

•

•

•

•

•

•

•

•

•

Attributes available for modification in the White Pages (the users is the right owner and their change do not require document sighting by another business authority)

Attributes (hidden) which are not available for modification in the White Pages (the user is not the attribute owner – have been assigned to the user, or the modification requires appropriate documentation to be sighted, eg. Name changes)

Once the attributes have been identified, there rules and procedures for each of the above situations will have to be defined and identify their availability in existing or future products and interfaces.

For Attributes available for User modification the following validation mechanisms are required:

Develop the data format validation interface (Web, e-forms, etc.)

Develop the data type validation – field length, strig, numeric, binary etc.

Develop the existence validation – real data is entered – phone numbers, zip codes.

Where possible values are known provide drop down options with pre-populated valid choices.

For attributes, which are not available for User modification, the following activities will have to be performed:

Identify the business requirements for data normalization and the most appropriate source

Define, quantify and qualify the extent of the programmatic effort required to achieve the desired level of Data normalization

Assign roles and responsibilities for manual correction, where required

3.4 Off-line Data loading and production cutover The aim of this exercise is to load the data in the Meta-directory and switch on, the connectors so that the Identity Management software and the vendor will deliver on their promise.

Before getting at this point you have to be confident of the data quality available in your off-line data-cleansing tool (MS-SQL, MS-ACCESS, etc.) since during the exercise, entities might have joined and left the organization. The nature of this field is that is in constant movement and it seems like a continuos catch-up at times. How can you be confident that the data you have available is ready to be loaded? There are several criteria, which will allow qualifying the data quality:

Joining tool entries are joined on one or several STRONG attributes

For the entities/situations where the STRONG attributes where not available, the administrators have identified the entity (based on social engineering) and the additional marking is creating the STRONG join.



•

•

•

•

There is a clear map of required marking attributes both in Meta-directory and in the relevant repositories.

This marking is clearly reflected in the connector specification in order to ensure persistency once the data is loaded.

Even though it is labour intensive, customers may decide to populate repositories without a STRONG attribute with the most likely STRONG attribute (usually the NOS logon) in other repositories. Therefore this may become the STRONG-joining attribute.

Where the above situation is not feasible or desirable (security concerns, external service provider, licensing implications), a unique random/meaningless ID (number) can be generated and populated in all the joined repositories, therefore becoming an AUTHORITATIVE /STRONG joining attribute. Note: check compliance with Privacy and other legislative requirements.

It is important to note that in situations where the AUTHORITATIVE/STRONG joining attribute is not available, both the connectors and the Meta-directory will have to account for the introduced marking which acts as additional joining attribute, therefore enhancing the joining authority.

When all this work has been performed and identified, the file can be exported in either CSV or LDIF format. Any vendor of Directories/Meta-Directories Software will be able to provide tools in order to import the file in the Directory.

Since intrinsically Directories are built to ensure entity uniqueness, special consideration should be paid for the definition of the Common Name (cn) and Distinguished Name (DN) in order to ensure and preserve this uniqueness. There are several approaches on the market, the most common one being using the NOS logon. However I argue that this has significant limitations in environments where your directory will have large number of entries not using IT resources, or using them in shared/group mode. Also, it may also have security implications, by exposing the NOS logon to Anonymous LDAP bind, as well as not being very user friendly. Historically the NOS logon creation rules might not account very well for similar names, may run out of options, therefore I will recommend something which will be rather user friendly.

During the DATA loading, entries are created in the Directory. It is important to observe that the same number of entries existing in your Data Joining tool will get created in the Directory. Any difference will indicate that some entries are identical and a subsequently one has been overwritten during the initial load. The Directory logs will help you in identifying the duplicates.

Once the Directory has been loaded and the connectors switched one it is recommended that any PERSON entity entry will only be created via the Provisioning Engine and any Entity Created by the application administration will be defeated and flagged for the attention of the Identity Manager in order to ensure compliance with the newly introduced business rules.

Since the Meta-Directories are expanding to more applications within an organization in relation to Entity information, it is expected that after the initial deployment, there will be more and more repositories integrated. Every time a new repository is added, a new OFF-LINE Data Cleansing exercise has to be conducted between the Directory Data and the respective repository information. This is required in order to identify the STRONG



Joining attribute, develop the required markings and analyse the business processes integration.

NON-PERSON entries should be marked accordingly so that they are not picked up by the connector technology and their situation will be treated in NON-PERSON entity Data Cleansing chapter.

3.5 In-line Data Joining In order to conduct the joining in this case the directory will become the joining engine. This in itself will have several advantages over the previous approach:

•

•

•

•

•

•

•

•

•

•

The directory technology is in a better position to ensure entity uniqueness, since it is an intrinsic feature of it’s design

The current connector technology has built-in support for Data mapping, Data normalization, Data transformation; placement policies, event transformation, attribute filtering and these are configurable.

The connector technology has built-in SMTP support; therefore all exceptions and errors can be automatically e-mailed to the designated attribute/issue owner for investigation and resolution.

The whole development life-cycle can be approached consistently, since one and the same team will manage all the events. Therefore a higher degree of data quality can be achieved

There is a greater probability of establishing a reference platform, clean known entities since the join can be performed progressively, repository after repository. Therefore a gradual approach will support the deployment of this approach. Even though it may seem time-intensive, due to the richness of features, the same result can be achieved in a shorter timeframe.

However, as expected there are some disadvantages with this approach:

The major issue is that this requires cohesive coordination during the connector implementation of the technical and business process skills.

It requires a significantly better product knowledge, since additional customisation and development will be involved.

It requires schema definition prior to the joining activity, since data will be consolidated inside the directory

It requires a decision on the Common Name (cn) and Distinguished Name (DN) definition, prior to data joining.

It requires that business rules and marking to be known in order to improve on the joining probability.

As much as there is a requirement to have specific definitions of the directory environment, it is to be noted, that for experienced Directory/Meta-directory consultants, this should not pose a problem since directory data conversion is straightforward.



Before commencing the DATA Joining it is essential to perform a DATA analysis in the spirit of Directory environment. Therefore dumps in LDIF or CSV format should be available from the repositories envisioned to be joined. The following should be ascertained:

•

•

•

•

•

The repository with the largest number of PERSON entities.

The repository with the authority to create the Directory entries

The existence of a STRONG joining attribute available across most repositories

If there is no STRONG-joining attribute, what are the best (most used and consistent representation) attributes across most repositories?

Is there acceptable to pass back a Unique Identifier to the connected repositories? For customers with good HR practices, it may be that the “employeeNumber” is available and it also ensures ENTITY uniqueness. However, HR has it’s own business rules, which will not always accommodate business rules of other groups.

The best case is achieved when the repository with the largest ENTITY numbers will be chosen as the one with the AUTHORITY to create the Directory entry. The Directory/Meta-Directory technology allows, different repositories to create directory entries. However, this approach is not recommended, since it increases the difficulty in managing the join and the decision about a valid versus invalid Directory ENTITY creation.

Considering that the repository with the largest number of entries is the same with the one chosen to create the directory entires, the first step is to decide on the required Directory attributes to be imported from this repository.

The next step is to decide to what directory attributes the repository attributes will be mapped, where the values will be placed and the amount of DATA Normalization to be performed.

Once these have been decided, the repository data dump (LDIF or CSV) should be loaded within the directory. This manual load does not preclude of configuring a connector to the target repository and loading the data in-line. It is up to the Consulting partner to alleviate possible customer‘s concerns and to develop and implement the right safeguards (development environments, separation from production data, etc.).

This load will create the Directory entries. It should be checked that the number of created entries is the same with the number of source repository entries. Any mismatch should be analysed and addressed. This initial load will also highlight possible DATA Normalization issues; source repository attributes values are not consistent in their type (numeric, Boolean, string), multi-value attributes, non-supported characters (in DAP – X500 based directories “/” has to be escaped in the values of an attribute).

Therefore the initial load, directory logs should be parsed for any possible error, exception from the ideal situation. There is considerable benefit in the amount of information, one basic activity like Directory load and entry creation will provide for all the areas of DATA Cleansing. Once the errors and exceptions have been identified, the Consulting partner in conjunction with the customer should assign roles and responsibilities in removing the causes of exceptions. By re-running the scenario (from



an empty Directory) the errors and exceptions can be e-mailed to the assigned roles, due to the in-built connector support.

The iterative process should be repeated until no more errors or exceptions are occurring any more.

This means that data is clean in regards to entity Directory entry creation and to some extent to data mapping. By using the LDAP queries, a list of all the attributes deemed important which are not available for any entry can be extracted and subsequently the repository owner can be requested to enter the relevant values.

The following step involves identifying the repository, which shares a STRONG-joining attribute with the original one. If this is the case you only have to deal with 2 exceptions, otherwise the situation depicted in table 1 will become the reference.

If a STRING-joining attribute is available, the possible situation is that either the original repository (now the Directory reference) or the new repository does not have the joining attribute populated or the entry does not exist in either one of them. By employing the social engineering techniques, this situation should be resolved and adequate marking in order to represent the corresponding exception to the general business rule should be introduced.

The connector technology should be employed to highlight the reason for exception and e-mail it to the respective repository owner. The iterative process should be repeated until no more errors or exceptions are e-mailed. These have been removed either by assigning the right entity information to the corresponding entry, or by enhancing the matching mechanism with additional marking, or by pushing back in both repositories a common STRONG joining attribute. There is no requirement to start fresh, since directories can be configured to replace attribute values with new ones and schema can be extended dynamically.

It is extremely critical that the second repository should not be allowed to create new entries, on an on-going basis. On on-going basis, entries will be reprovisioned via the provisioning engine. This is critical in order to achieve consistent and persistent data quality. Additional attributes can be mapped to the entity entry via this process as well as certain degree of DATA Normalization can be delivered via the connector transformation rules.

At this stage you have achieved a clean, consistent Directory Information Tree. This will allow you to repeat the process with new repository until the joining reaches the agreed standard of cleanliness and then move onto another repository.

The benefit of this method is that additional connectors can be easily deployed and integrated within the existing infrastructure, since the existing data is the reference for new developments. Also, it establishes a pattern of expectation within customer environment, therefore making it considerably less painful to extend the integration to a large number of repositories. This will enable the organisations to prioritise their deliverables, either in increased perimeter security, IT security, extended functionality, asset management, cost recovery and cross billing purposes (based on real time user base).

It may seem that by employing this staggered approach, it takes considerable more time to achieve similar results by comparison with the OFF-LINE Data Joining method. However, this is incremental work, therefore the knowledge transfer occurs significantly faster, since the same approach and methodology will be employed in any given



situation. Therefore after the initial set-up, load and join, the activities can significantly increase in speed and accuracy, due to customer’s active participation.

Of significant importance is that once a baseline of Meta-Directory data has been achieved; it can be made available to multiple developers, which enables concurrent development for different connectors. The main qualifier being that none of the subsequent repositories to be joined are allowed to create new entries. If this requirement cannot be satisfied via social engineering, then the repository in discussion should be analysed and join with the Meta-Directory towards achieving a new baseline of Joined repositories.

It is critical that PERSON only entries from each in scope repository are analysed both in regards to number of entries, joining attributes, authoritative attributes, prior to commencing any DATA Cleansing activities. It is advisable that the repository with the largest number of entries to be loaded first in order to establish the baseline and the following repositories will be loaded. The loadind order is to be established as a function of STRONG Joining attribute and AUTHORITATIVE attributes.

The level of data quality, consistency and persistency to be achieved via IN-LINE DATA Joining it is superior to the one delivered by OFF-LINE DATA Joining methods.

However, without a clear methodology in development, error and exception handling as well the lack of understanding and alignment to the business processes, the vendor or system integrator may not choose this path at any cost.

3.6 In-line Data Mapping In-line Data Mapping is a by-product of the IN-LINE DATA Joining. It is a normal connector activity to map additional repository attributes to the desired Meta-directory attributes. This is a straightforward connector configuration. Generally this is achieved within the FILTER or mapping part of the connector where a correspondence table can be built.

Of special consideration at this step is the DATA TYPE mapping. This means that without any DATA TYPE transformation, the repository attribute value should match the Meta-directory attribute type (eg. A repository STRING attribute type should be mapped to a STRING attribute type rather than a numeric in Meta-directory). However, modern Meta-directory technology allows for DATA TYPE translation.

An additional benefit of modern connector technology is the bi-directional functionality delivered and easy to be implemented. In that sense additional attributes, not yet available in the underlying repository ca be written back from those associated with the existing Meta-directory entries, thus increasing the data consistency and cleanliness and also enriching the functionality delivered by the underlying repository, if not to the user community, at least to the IT Operations group.

3.7 In-line Data Normalization The scope and extent of Data Normalization has to be agreed with the customer before attempting to resolve it within the connector technology.

This is has a duopolistic approach in defining the DATA Normalization requirements for all:

• Data format storage



•

•

•

•

•

•

•

•

•

•

Data format presentation

Data format required for application interchange/interoperability

As clearly identified above, for attributes not available to users editing, the connector DATA Normalization will have a higher weighing as opposed to attributes identified as being under undisputed user ownership and control. For this situation, there may be a better way in providing choices via drop-down options lists in the White-Pages, where the user community will correct their data. It is of significant importance that this approach will implicitly gather support from a privacy business perspective.

For those attributes, outside of direct user ownership and control a significant amount of DATA Normalization can be achieved within the connector technology.

This can have several approaches, depending on attribute types and the DATA relevance to the delivery of the required business function:

Identify and define the acceptable data storage, presentation and application interchange requirements

Identify available standards for the attribute type in discussion and the requirement of compliance

Identify the availability of template data reference (eg. For certain attributes, only certain values will be valid and the customer has the list of available choices and is willing to enforce this compliance)

Identify the number of available options and the relationships with other attributes in order to perform this exercise in a programmatic way

The benefit of this exercise is that Normalized Data can subsequently be written back in the connected repositories, therefore improving the functionality and user experience of the target application. Also, Normalised data can be passed to repositories that do not have the respective attributes populated.

The best outcome is usually achieved when there is a mixed approach:

Develop drop-down options-lists for users to self correct their owned attributes, via the White Pages self service interface

Develop intelligent DATA Entry interface for the Provisioning Engine, where data is validated for quality and validity

Develop a programmatic approach for data normalization, within the connector transformation rules. To be noted, that even though this can be achieved, vendors are reluctant to engage in providing this functionality unless the requirements, scope and references are clearly defined. Therefore in order to achieve maximum, benefit, customers will be advised to identify tight business rules, which pertain to a programmatic approach.

Leave data not normalized, for attributes of lesser application functionality relevance (informative only) and try to identify another application, where the



respective attributes are AUTHORITATIVE and thus achieve the desired level of NORMALIZATION.

3.8 In-line Data loading and production cutover Prior to the Meta-directory production deployment, the following activities have to reach completion. Connectors to different repositories are developed, the business rules are already implemented, marking is available and all the repositories entries have been resolved and currently are Joined in the Meta-Directory via a STRONG attribute or a conjunction of SOFT matching and a STRONG marking.

Also, the other elements of the architecture, business process change and governance have been developed:

•

•

•

•

•

•

•

The PROVISIONING ENGINE has been developed, the new business rules have been modelled, and the data model has been loaded for quality and functionality checks. All the logical and functional relationships between attributes have been encoded for additional DATA ENTRY checks. Any PERSON entity provisioning function in the connected repositories will be defeated and reverted to the previous status. Any future provisioning activity has been exclusively assigned to the PROVISIONING ENGINE. The PROVISIONING ENGINE has been developed with DELEGATED administration.. Functionally. the ENTITY creation propagates from the PROVISIONING Engine to the Meta-directory, you are able to apply the desired granularity for the services you want to provision. The Engine is EXTENSIBLE and its GOVERNANCE model has been developed and agreed.

The White Pages have been developed, the PRIVACY requirements have been met, the Self-Service functionality is available and the data entry has been enhanced with quality checks, both in regard to data types and functionality. The organisation’s own information has been populated where available in drop down options for consistency and assigned roles and responsibilities in regards to data updates and maintenance.

The change management and training is underway, the SERVICE DESK, the users of the PROVISIONING ENGINE and of the White Pages are trained. Delegated administration has been developed (user, service desk, meta-directory administrator).

Roles and responsibilities have been developed and agreed in regards to the administration of the Meta-Directory infrastructure. The governance model has been developed, agreed and implemented. The business and technical ownership of the functionality and delivery of functional and technical capability has been assigned. The governance for further development has been developed, agreed and implemented.

The path forward has been mapped with applications, repositories and increased functionality to be further developed and deployed.

Should the Identity Vault be the LDAP Authentication Directory (Y/N) – if provides Automatic fail over, true high availability, LDAP replication, LDAP hierarchy integration (parts of the tree are controlled at their source) – YES. IF the high availability and LDAP replication are not available, it is advisable to separate the environments.

The environment has been undergoing and satisfied all the testing requirements



Then, you are ready to move it into PRODUCTION. The main questions asked in view of this stage are:

•

•

How to load the data within the Meta-Directory?

How to cutover into production?

With IN-LINE DATA CLEANSING, there is no initial LOAD. The initial LOAD has occurred long time ago, when the architect and developer loaded your first repository into their development systems. There has been an incremental progression of the system towards a fully functional environment.

However it is possible to start fresh in order to ensure that all the JOINING and MARKING are delivering the quality, consistency and persistency desired. Also, organisations are living organisms in continuos change and the design and developed solution are required to accommodate this, data has changed, people left and came within the organization over the life of the project. Therefore, new data sets from the underlying repositories can be loaded, initially the first repository’s data, ensure that there are no errors, then the second repository data, ensure there are no functionality errors and so on. As you previously advised, the maximum number of repositories to be JOINED during the first phase is recommended to be three. Thus a clean, reliable, consistent and persistent JOINED data source ca be established.

The added benefit of this approach is that minimizes the impact on the customer. The production cutover, where the new business processes will be turned on, the new PROVISIONING ENGINE will get into life, the White Pages will provide the means of modifying own user data can be reduced to an overnight process, if not hours. Considering that this environment may have to be deployed in global organization, these customers are always on the job; therefore the impact to the business should be minimized.

Since the current connector technology is event driven, there is no need to have long freezes for the PROVISIONING activities. The main qualifier is that the order the connectors are switched on should be the same as the order they have been added to the Meta-Directory – thus also reflecting the dependencies and order of the provisioning/deprovisioning process.

It is advisable that an assessment to be conducted by the vendor, customer and consultant within the first week. During the first month of PRODUCTION, the customer should closely monitor the environment and all issues should be collated.

As expected over time, more Entity dependent applications will be meshed in the Meta-Directory. Therefore new joining attributes, processes, markings, business rules will be added to the space of Identity Management. The advantage of the In-Line Joining Methodology is that it has a very minimal impact in developing the changes associated with the integration of an additional repository. All the integration work can be carried on in the TEST or PRE-PRODUCTION environment and only a reliable and persistent solution should be cut over in PRODUCTION.



4 Non-PERSON Entities Data Cleansing Who else can attack your organization, can people hide behind applications, systems, generic accounts?

What are their attributes?

How would you like to store and present this information?

It is often chosen to be ignored, but there are significant risks posed to the IT Operations and to the business itself, if NON-PERSON ENTITIES are not treated in a similar way as the PERSON ENTITIES. In general terms these can be separated in DEVICES and SERVICES. The Networks and Operating system groups generally account for DEVICES in the course of IT Operations Management. However, SERVICES are traditionally unaccounted for and there is a proliferation of traditional and especially newly developed web services.

In the current environment of outsourcing and off shoring, it is not uncommon that IT technical administrators are feeling intimidated if not outright threatened. It should not be of any surprise that they will create accounts with significant authority either mimicking PERSON entities or even worse NON-PERSON ENTITIES.

It is therefore imperative that NON-PERSON ENTITITES are treated in a similar way as the PERSON ENTITIES. By performing the NON-PERSON ENTITIES Joining, mapping, normalization and loading in the Directory, the customer is in a better position to manage its Authentication and Authorization processes. The same diagram depicted in figure 1 will apply. It is imperative to develop a PROVISIONING ENGINE, which will consolidate the CREATION, DEACTIVATION, ACTIVATION, and DELETION of system, applications, testing and services accounts.

In te rn a l D ire c to ry T re e

P E R S O N N O N -P E R S O N

O R G C h a rt

M y O R G A N IZ A T IO N

A C T IV E IN A C T IV EA C T IV E IN A C T IV E

Figure 1. Figure description

To the same extent, the NON-PERSON ENTITIES White Pages can provide a consolidated view of all the relevant accounts, the applications they are available to, the user/owner of that account, provide updates in re-assigning it to a different user and so on.

For the purpose of consistency, either approach can be employed in providing the production data:



•

•

•

•

OFF-LINE Joining, Mapping, Normalization and Directory loading

IN-LINE Joining, Mapping, Normalization and Directory loading

However, it is advisable to employ the same approach as used for PERSON ENTITIES. In this case the processes of Joining, Mapping, Normalization and Directory loading can be performed simultaneously.

Since, the NON-PERSON ENTITIES should be differentiated from the PERSON ENTITIES, it is advisable that they should be stored in a separate Directory tree, applicable to this type of ENTITIES.

In similar way as the PROVISIONING activities in regards to the PERSON ENTITIES are assigned to the business, the NON–PERSON ENTITIES PROVISIONING activities could be assigned to the SERVICE DESK. This introduces a psychological barrier and consistency of approach in regards to the NON-PERSON ENTITIES business rules application.

By embarking on the NON-PERSON Entities Data Cleansing, organizations should be aware of the task complexity, due to the ENTITY classification and the requirement to decide on the governance, ownership and operational administration models to be developed.

Another major distinction is that NON-PERSON Entities can be:

Existent in multiple repositories (e.g. the same test accounts can be used across MS-Office and EDM). This is ideal when the scope is to control and consolidate the administration of those accounts. However, similarly named Entities can exist in multiple repositories, but they are different between themselves (typical the “administrator” account, which are different between OS, Application, Business Application). In this case each will create a different directory entry with enough information to discriminate between similar information.

Existent in only one instance. In this category are most of the hardware and equipment (servers, switches, routers, hubs, etc). Even though obviously customer will have several similar pieces of technology, their role in the Operation Infrastructure is uniquely defined. However there will be several services to benefit from their Identity; IT Operations, lease & asset management, accounting. These can help in building the list of attributes required for these entities.

It is recommended to undertake a progressive approach in extending to NON-PERSON ENTITIES Data Cleansing. In this regard, APPLICATIONS and FUNCTIONAL Entities will be present in the connected repositories and will be possible to perform the CLEANSING during the PERSON ENTITIES Data Cleansing. As you might remember these have already been marked out at the first step of PERSON ENTITIES Data Cleansing.

At a later stage, as more experience is built, the NETWORK Entities will have to be accounted for. It is to be noted, that these entities are critical pieces of technology in delivering a layer of defence in protecting the organisation as well as enabling secure internal transactions and not so far into the future secure federated transactions.

4.1 Off-line Data Joining For consistency purposes, the same process and tool chosen for PERSON Entity Data Joining should be used during the NON-PERSON Entity Data Joining



Whereas in the course of Joining Person Entities, there was an expectation that most of them will be available within more repositories, there should be no surprise that the NON-PERSON Entities are not yet transgressing too many repositories.

It is therefore expected that the exercise will mostly provide an inventory of NON-PERSON Entities. Also, across different applications, accounts may have the same names. Therefore is the credentials (Authentication and Authorizations profile) are different, it is imperative to further describe the account with system relevant information

4.2 Off-line Data Mapping As mentioned above, additional mapping is required in order to localise the entity to it’s home environment and to place it in the right context. It is expected that additional descriptive and marking attributes will be introduced at this stage.

The categories of attributes used for PERSON Entity Data Mapping are also relevant for NON-PERSON Entities:

•

•

•

•

•

•

•

•

•

•





Additional attributes, which owners of the existing or planned applications will use, therefore will be consequently imported in those applications.



Considering the desired outcome, consistent management and delegated administration of Entities, there is an expectation that significant consideration will be placed upon schema design at this stage.

4.3 Off-line Data Normalization Whereas for the PERSON Entity information, Normalization delivered significant benefits and was strongly encouraged, for NON-PERSON Entities, normalization is mostly prompting for a consistent naming convention across the environment. Also, it triggers the definition and governance of the descriptive attribute information in order to achieve the best:

Data consistency

Data completeness

Data correctness



These requirements have to be accounted for both in schema definition as well as in the data quality at different stages within the environment:

•

•

•

•

•

Data acquisition or definition, the point where the attribute value it is defined

Data storage, the format which the value will be stored into

Data presentation, the format the attribute value will be presented to different groups (eg. Networks, OS administration, System Management, Archive and Restore, facilities, etc.)

4.4 Off-line Data loading and production cutover For NON-PERSON Entities, the connectors might have not much to do with maintaining the joining of identities, but rather to keep the information up to date regarding different attribute values.

It is significant to consider the part of the tree where the NON-PERSON Entities will be loaded and the extent to which the management of these identities will be delivered from a directory perspective. Although, significant numbers and types of DEVICES are already directory centric, there might be a difficulty to accept the SERVICE Entities to be managed directory centric. This decision will be left to the customer

Traditionally, DEVICES directory entries have been placed in trees that resembled the geography of their locality. Unless there are directory replications issue, this has to be re-assessed. Most system management applications deployed so far can easily provide a geographic representation for DEVICES distribution; therefore other factors may come into play at this stage.

Regarding the SERVICES tree, it’s design should be approached from the angle of application architecture and integration relevant to the specific organisation.

4.5 In-line Data Joining A major problem occurs in relation to NON-PERSON entities IN-line Joining. Whereas for PERSON entities, the likelyhood that different entities will have same attributes (Surname, Given Name) was rare relative to the number of entitites, it is to be expected that for NON-PERSON entities this may become the rule.

Therefore NON-PERSON entities should be separated in:

Services – applications and functional entities

Devices – network entities, computers, routers, SAN

Therefore it is each entity should be tagged with the home repository, due to the fact that in another repository will actually be a different entity (eg. Admin account in MS-Windows and NetWare). This will lead to the creation of a central Authentication/Authorization repository, since there will be few entities matched across repositories. The likely candidates are the test and training accounts, if the organisation supports consistent environment.



4.6 In-line Data Mapping Due to the situation described above, it si receommended to tag the entities with as many attributes as possible in order to identify it’s origin.

The categories of attributes used for PERSON Entity Data Mapping are also relevant for NON-PERSON Entities:

•

•

•

•

•

•

•

•



All attributes which uniquely and indiscriminately identify the entity



Additional attributes, which owners of the existing or planned applications will use, therefore will be consequently imported in those applications.



It is desired to have a consistent schema, therefore information clasification resembling Configuration Management Database schemas can be used. This exercise will help with the naming conventions of both APPLICATIONS and DEVICES, NON-PERSON entities.

4.7 In-line Data Normalization Similarly as the mapping of the NON-PERSON entities, the DATA normalization can bring significant improvements in the space of APPLICATIONS and DEVICES Configuration Management. It will prompt for consistent naming conventions and discovery of the attributes available for different entities.

By streamlining the naming convention and building a CMDB withcomprehensive sets of attributes, this exercise will deliver benefits in the space of Identity Driven Netwroks, asset management, security, operational policies and so on.

4.8 In-line Data loading and production cutover This is a relatively new approach for NON-PERSON entities. Therefore as well as in the space of PERSON entities, it requires a long term vision supported by constant controllable deliverables which provide environmental incremental improvements. As part of the strategy it will help to prioritize the areas and sequence of entities to be considered. It may require different directories, an application directory and a devices directory.

Since PERSON entities, are using applications entities as well as devices, careful scalable inter-dependent design should be considered.



Index

a

C

consultant, 9, 21

D

DATA, 5, 7, 9, 11, 14, 16, 17, 18, 19, 20, 21, 26 Data cleansing, 3, 4, 5

E

engagement, 4, 6, 9 Entities, 3, 4, 22, 23, 24, 25, 26

F

FUNCTIONAL, 5, 23

G

governance, 4, 6, 20, 23, 24

I

Identity Management, 3, 4, 5, 7, 9, 11, 12, 13, 21

J

Joining, 3, 5, 7, 8, 9, 11, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25

M

MANIPULATION, 6 Mapping, 3, 5, 7, 8, 11, 18, 23, 24, 26 Meta-Directories, 4, 14 metrics, 4

N

NETWORK, 5, 23 Normalization, 3, 5, 7, 8, 12, 16, 17, 18, 19, 23, 24, 26

P

PERSON, 3, 4, 5, 7, 8, 14, 15, 16, 18, 20, 22, 23, 24, 25, 26

PRESENTATION, 6

S

STORAGE, 6

T

type, 4, 5, 13, 16, 18, 19, 23

V

vendors, 4, 6, 19 Virtual Directories, 4, 5


data cleansing guide - v-techv-tech.com.au/cms/download/data cleansing guide.pdf · data cleansing...

Documents