data modeling

41
Abdoulaye M Yansane, Abdoulaye Mouke Yansane, Mouke Yansane Data Modeling 101 www.agiledata.org: Techniques for Successful Evolutionary/Agile Database Development Home Searc h Agile DBAs Develope rs Enterprise Architects Enterprise Administrators Best Practices The goals of this article are to overview fundamental data modeling skills that all developers should have, skills that can be applied on both traditional projects that take a serial approach to agile projects that take an evolutionary approach. My personal philosophy is that every IT professional should have a basic understanding of data modeling. They don’t need to be experts at data modeling, but they should be prepared to be involved in the creation of such a model, be able to read an existing data model, understand when and when not to create a data model, and appreciate fundamental data design techniques. This article is a brief introduction to these skills. The primary audience for this article is application developers who need to gain an understanding of some of the critical activities performed by an Agile DBA. This understanding should lead to an appreciation of what Agile DBAs do and why they do them, and it should help to bridge the communication gap between these two roles. Table of Contents 1. What is data modeling? o How are data models used in practice? o What about conceptual models? o Common data modeling notations 2. How to model data o Identify entity types o Identify attributes o Apply naming conventions o Identify relationships o Apply data model patterns o Assign keys

Upload: ayansane635

Post on 12-Dec-2014

55 views

Category:

Documents


4 download

DESCRIPTION

Data modeling...

TRANSCRIPT

Page 1: Data Modeling

Abdoulaye M Yansane, Abdoulaye Mouke Yansane, Mouke Yansane

Data Modeling 101

www.agiledata.org: Techniques for Successful

Evolutionary/Agile Database Development

Home Search Agile DBAs Developers Enterprise Architects Enterprise Administrators Best Practices

The goals of this article are to overview fundamental data modeling skills that all

developers should have, skills that can be applied on both traditional projects that

take a serial approach to agile projects that take an evolutionary approach.  My

personal philosophy is that every IT professional should have a basic

understanding of data modeling.  They don’t need to be experts at data modeling,

but they should be prepared to be involved in the creation of such a model, be

able to read an existing data model, understand when and when not to create a

data model, and appreciate fundamental data design techniques.  This article is a

brief introduction to these skills.  The primary audience for this article is

application developers who need to gain an understanding of some of the critical

activities performed by an Agile DBA.  This understanding should lead to an

appreciation of what Agile DBAs do and why they do them, and it should help to

bridge the communication gap between these two roles.

 

 

Table of Contents

1.What is data modeling?

o How are data models used in practice?  

o What about conceptual models?  

o Common data modeling notations

2.How to model data  

o Identify entity types

o Identify attributes

o Apply naming conventions

o Identify relationships

o Apply data model patterns

o Assign keys  

o Normalize to reduce data redundancy

o Denormalize to improve performance

3.Evolutionary/agile data modeling

4.How to become better at modeling data 

 

1. What is Data Modeling?

Page 2: Data Modeling

Data modeling is the act of exploring data-oriented structures.  Like other modeling artifacts data models can be used

for a variety of purposes, from high-level conceptual models to physical data models.  From the point of view of an

object-oriented developer data modeling is conceptually similar to class modeling. With data modeling you identify

entity types whereas with class modeling you identify classes.  Data attributes are assigned to entity types just as you

would assign attributes and operations to classes.  There are associations between entities, similar to the associations

between classes – relationships, inheritance, composition, and aggregation are all applicable concepts in data

modeling.

Traditional data modeling is different from class modeling because it focuses solely on data – class models allow you

to explore both the behavior and data aspects of your domain, with a data model you can only explore data issues. 

Because of this focus data modelers have a tendency to be much better at getting the data “right” than object

modelers.  However, some people will model database methods (stored procedures, stored functions, and triggers)

when they are physical data modeling.  It depends on the situation of course, but I personally think that this is a good

idea and promote the concept in my UML data modeling profile (more on this later).

Although the focus of this article is data modeling, there are often alternatives to data-oriented artifacts (never forget

Agile Modeling’s Multiple Models principle).  For example, when it comes to conceptual modeling ORM

diagrams aren’t your only option – In addition to LDMs it is quite common for people to create UML class

diagrams and even Class Responsibility Collaborator (CRC) cards instead.  In fact, my experience is that CRC

cards are superior to ORM diagrams because it is very easy to get project stakeholders actively involved in the

creation of the model.  Instead of a traditional, analyst-led drawing session you can instead facilitate stakeholders

through the creation of CRC cards.

 

1.1 How are Data Models Used in Practice?

Although methodology issues are covered later, we need to discuss how data models can be used in practice to better

understand them. You are likely to see three basic styles of data model:

Conceptual data models.  These models, sometimes called domain models, are typically used to explore

domain concepts with project stakeholders.  On Agile teams high-level conceptual models are often created

as part of your initial requirements envisioning efforts as they are used to explore the high-level static

business structures and concepts.  On traditional teams conceptual data models are often created as the

precursor to LDMs or as alternatives to LDMs. 

Logical data models (LDMs).  LDMs are used to explore the domain concepts, and their relationships, of

your problem domain.  This could be done for the scope of a single project or for your entire enterprise. 

LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes

describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects

although often are on traditional projects (where they rarely seem to add much value in practice).

Physical data models (PDMs).  PDMs are used to design the internal schema of a database, depicting the

data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to

be useful on both Agile and traditional projects and as a result the focus of this article is on physical

modeling.

Page 3: Data Modeling

Although LDMs and PDMs sound very similar, and they in fact are, the level of detail that they model can be

significantly different.  This is because the goals for each diagram is different – you can use an LDM to explore

domain concepts with your stakeholders and the PDM to define your database design.  Figure 1 presents a simple

LDM and Figure 2 a simple PDM, both modeling the concept of customers and addresses as well as the relationship

between them.  Both diagrams apply the Barker notation, summarized below.  Notice how the PDM shows greater

detail, including an associative table required to implement the association as well as the keys needed to maintain the

relationships.  More on these concepts later.  PDMs should also reflect your organization’s database naming

standards, in this case an abbreviation of the entity name is appended to each column name and an abbreviation for

“Number” was consistently introduced.  A PDM should also indicate the data types for the columns, such as integer

and char(5).  Although Figure 2 does not show them, lookup tables (also called reference tables or description

tables) for how the address is used as well as for states and countries are implied by the attributes

ADDR_USAGE_CODE, STATE_CODE, and COUNTRY_CODE.

 

Figure 1. A simple logical data model.

 

Figure 2. A simple physical data model.

An important observation about Figures 1 and 2 is that I’m not slavishly following Barker’s approach to naming

relationships. For example, between Customer and Address there really should be two names “Each CUSTOMER

may be located in one or more ADDRESSES” and “Each ADDRESS may be the site of one or more

CUSTOMERS”. Although these names explicitly define the relationship I personally think that they’re visual noise

that clutter the diagram. I prefer simple names such as “has” and then trust my readers to interpret the name in each

direction. I’ll only add more information where it’s needed, in this case I think that it isn’t. However, a significant

Page 4: Data Modeling

advantage of describing the names the way that Barker suggests is that it’s a good test to see if you actually

understand the relationship – if you can’t name it then you likely don’t understand it.

Data models can be used effectively at both the enterprise level and on projects. Enterprise architects will often

create one or more high-level LDMs that depict the data structures that support your enterprise, models typically

referred to as enterprise data models or enterprise information models. An enterprise data model is one of several

views that your organization’s enterprise architects may choose to maintain and support – other views may explore

your network/hardware infrastructure, your organization structure, your software infrastructure, and your business

processes (to name a few). Enterprise data models provide information that a project team can use both as a set of

constraints as well as important insights into the structure of their system.

Project teams will typically create LDMs as a primary analysis artifact when their implementation environment is

predominantly procedural in nature, for example they are using structured COBOL as an implementation language.

LDMs are also a good choice when a project is data-oriented in nature, perhaps a data warehouse or reporting

system is being developed (having said that, experience seems to show that usage-centered approaches appear to

work even better). However LDMs are often a poor choice when a project team is using object-oriented or

component-based technologies because the developers would rather work with UML diagrams or when the project is

not data-oriented in nature. As Agile Modeling advises, apply the right artifact(s) for the job. Or, as your

grandfather likely advised you, use the right tool for the job.  It's important to note that traditional approaches to

Master Data Management (MDM) will often motivate the creation and maintenance of detailed LDMs, an effort

that is rarely justifiable in practice when you consider the total cost of ownership (TCO) when calculating the return

on investment (ROI) of those sorts of efforts.

When a relational database is used for data storage project teams are best advised to create a PDMs to model its

internal schema.  My experience is that a PDM is often one of the critical design artifacts for business application

development projects. 

 

2.2. What About Conceptual Models?

Halpin (2001) points out that many data professionals prefer to create an Object-Role Model (ORM), an example is

depicted in Figure 3, instead of an LDM for a conceptual model.  The advantage is that the notation is very simple,

something your project stakeholders can quickly grasp, although the disadvantage is that the models become large

very quickly.  ORMs enable you to first explore actual data examples instead of simply jumping to a potentially

incorrect abstraction – for example Figure 3 examines the relationship between customers and addresses in detail. 

For more information about ORM, visit www.orm.net. 

 

Figure 3. A simple Object-Role Model.

Page 5: Data Modeling

My experience is that people will capture information in the best place that they know.  As a result I typically discard

ORMs after I’m finished with them.  I sometimes user ORMs to explore the domain with project stakeholders but

later replace them with a more traditional artifact such as an LDM, a class diagram, or even a PDM.  As a

generalizing specialist, someone with one or more specialties who also strives to gain general skills and knowledge,

this is an easy decision for me to make; I know that this information that I’ve just “discarded” will be captured in

another artifact – a model, the tests, or even the code – that I understand.  A specialist who only understands a

limited number of artifacts and therefore “hands-off” their work to other specialists doesn’t have this as an option. 

Not only are they tempted to keep the artifacts that they create but also to invest even more time to enhance the

artifacts.  Generalizing specialists are more likely than specialists to travel light.

 

2.3. Common Data Modeling Notations

Figure 4 presents a summary of the syntax of four common data modeling notations: Information Engineering (IE),

Barker, IDEF1X, and the Unified Modeling Language (UML).  This diagram isn’t meant to be comprehensive,

instead its goal is to provide a basic overview.  Furthermore, for the sake of brevity I wasn’t able to depict the

highly-detailed approach to relationship naming that Barker suggests.  Although I provide a brief description of each

notation in Table 1 I highly suggest David Hay’s paper A Comparison of Data Modeling Techniques as he goes

into greater detail than I do. 

 

Figure 4. Comparing the syntax of common data modeling notations.

Page 6: Data Modeling
Page 7: Data Modeling

 

Table 1. Discussing common data modeling notations.

Notation Comments

IE

The IE notation (Finkelstein 1989) is simple and easy to read, and is well suited for high-level

logical and enterprise data modeling.  The only drawback of this notation, arguably an advantage, is

that it does not support the identification of attributes of an entity.  The assumption is that the

attributes will be modeled with another diagram or simply described in the supporting documentation.

Barker

The Barker notation is one of the more popular ones, it is supported by Oracle’s toolset, and is well

suited for all types of data models.  It’s approach to subtyping can become clunky with hierarchies

that go several levels deep.

IDEF1X

This notation is overly complex.  It was originally intended for physical modeling but has been

misapplied for logical modeling as well. Although popular within some U.S. government agencies,

particularly the Department of Defense (DoD), this notation has been all but abandoned by everyone

else.  Avoid it if you can.

UML

This is not an official data modeling notation (yet).  Although several suggestions for a data

modeling profile for the UML exist, none are complete and more importantly are not “official”

UML yet.  However, the Object Management Group (OMG) in December 2005 announced an RFP

for data-oriented models.

 

3. How to Model Data

It is critical for an application developer to have a grasp of the fundamentals of data modeling so they can not only

read data models but also work effectively with Agile DBAs who are responsible for the data-oriented aspects of

your project. Your goal reading this section is not to learn how to become a data modeler, instead it is simply to gain

an appreciation of what is involved.

The following tasks are performed in an iterative manner:

Identify entity types

Identify attributes

Apply naming conventions

Identify relationships

Apply data model patterns

Assign keys

Normalize to reduce data redundancy

Denormalize to improve performance

  Very good practical books about data modeling include Joe

Celko’s Data & Databases and Data Modeling for

Information Professionals as they both focus on practical

issues with data modeling. The Data Modeling Handbook

Page 8: Data Modeling

and Data Model Patterns are both excellent resources once

you’ve mastered the fundamentals. An Introduction to

Database Systems is a good academic treatise for anyone

wishing to become a data specialist.

 

3.1 Identify Entity Types

An entity type, also simply called entity (not exactly accurate terminology, but very common in practice), is similar

conceptually to object-orientation’s concept of a class – an entity type represents a collection of similar objects. An

entity type could represent a collection of people, places, things, events, or concepts. Examples of entities in an

order entry system would include Customer, Address, Order, Item, and Tax. If you were class modeling you would

expect to discover classes with the exact same names. However, the difference between a class and an entity type is

that classes have both data and behavior whereas entity types just have data.

Ideally an entity should be normal, the data modeling world’s version of cohesive. A normal entity depicts one

concept, just like a cohesive class models one concept. For example, customer and order are clearly two different

concepts; therefore it makes sense to model them as separate entities.

 

3.2 Identify Attributes

Each entity type will have one or more data attributes. For example, in Figure 1 you saw that the Customer entity

has attributes such as First Name and Surname and in Figure 2 that the TCUSTOMER table had corresponding data

columns CUST_FIRST_NAME and CUST_SURNAME (a column is the implementation of a data attribute within a

relational database).

Attributes should also be cohesive from the point of view of your domain, something that is often a judgment call. –

in Figure 1 we decided that we wanted to model the fact that people had both first and last names instead of just a

name (e.g. “Scott” and “Ambler” vs. “Scott Ambler”) whereas we did not distinguish between the sections of an

American zip code (e.g. 90210-1234-5678). Getting the level of detail right can have a significant impact on your

development and maintenance efforts. Refactoring a single data column into several columns can be difficult,

database refactoring is described in detail in Database Refactoring, although over-specifying an attribute (e.g.

having three attributes for zip code when you only needed one) can result in overbuilding your system and hence you

incur greater development and maintenance costs than you actually needed.

 

3.3 Apply Data Naming Conventions

Your organization should have standards and guidelines applicable to data modeling, something you should be able

to obtain from your enterprise administrators (if they don’t exist you should lobby to have some put in place). These

Page 9: Data Modeling

guidelines should include naming conventions for both logical and physical modeling, the logical naming

conventions should be focused on human readability whereas the physical naming conventions will reflect technical

considerations. You can clearly see that different naming conventions were applied in Figures 1 and 2.

As you saw in Introduction to Agile Modeling, AM includes the Apply Modeling Standards practice. The basic

idea is that developers should agree to and follow a common set of modeling standards on a software project. Just

like there is value in following common coding conventions, clean code that follows your chosen coding guidelines

is easier to understand and evolve than code that doesn't, there is similar value in following common modeling

conventions.

 

3.4 Identify Relationships

In the real world entities have relationships with other entities. For example, customers PLACE orders, customers

LIVE AT addresses, and line items ARE PART OF orders. Place, live at, and are part of are all terms that define

relationships between entities. The relationships between entities are conceptually identical to the relationships

(associations) between objects.

Figure 5 depicts a partial LDM for an online ordering system. The first thing to notice is the various styles applied

to relationship names and roles – different relationships require different approaches. For example the relationship

between Customer and Order has two names, places and is placed by, whereas the relationship between Customer

and Address has one. In this example having a second name on the relationship, the idea being that you want to

specify how to read the relationship in each direction, is redundant – you’re better off to find a clear wording for a

single relationship name, decreasing the clutter on your diagram. Similarly you will often find that by specifying the

roles that an entity plays in a relationship will often negate the need to give the relationship a name (although some

CASE tools may inadvertently force you to do this). For example the role of billing address and the label billed to

are clearly redundant, you really only need one. For example the role part of that Line Item has in its relationship

with Order is sufficiently obvious without a relationship name.

Figure 5. A logical data model (Information Engineering notation).

 

Page 10: Data Modeling

You also need to identify the cardinality and optionality of a relationship (the UML combines the concepts of

optionality and cardinality into the single concept of multiplicity). Cardinality represents the concept of “how many”

whereas optionality represents the concept of “whether you must have something.” For example, it is not enough to

know that customers place orders. How many orders can a customer place? None, one, or several? Furthermore,

relationships are two-way streets: not only do customers place orders, but orders are placed by customers. This leads

to questions like: how many customers can be enrolled in any given order and is it possible to have an order with no

customer involved? Figure 5 shows that customers place one or more orders and that any given order is placed by

one customer and one customer only. It also shows that a customer lives at one or more addresses and that any given

address has zero or more customers living at it.

Although the UML distinguishes between different types of relationships – associations, inheritance, aggregation,

composition, and dependency – data modelers often aren’t as concerned with this issue as much as object modelers

are. Subtyping, one application of inheritance, is often found in data models, an example of which is the is a

relationship between Item and it’s two “sub entities” Service and Product. Aggregation and composition are much

less common and typically must be implied from the data model, as you see with the part of role that Line Item takes

with Order. UML dependencies are typically a software construct and therefore wouldn’t appear on a data model,

unless of course it was a very highly detailed physical model that showed how views, triggers, or stored procedures

depended on other aspects of the database schema.

 

3.5 Apply Data Model Patterns

Some data modelers will apply common data model patterns, David Hay’s book Data Model Patterns is the best

reference on the subject, just as object-oriented developers will apply analysis patterns (Fowler 1997; Ambler 1997)

and design patterns (Gamma et al. 1995). Data model patterns are conceptually closest to analysis patterns

because they describe solutions to common domain issues. Hay’s book is a very good reference for anyone involved

in analysis-level modeling, even when you’re taking an object approach instead of a data approach because his

patterns model business structures from a wide variety of business domains.

 

3.6 Assign Keys

There are two fundamental strategies for assigning keys to tables.  First, you could assign a natural key which is one

or more existing data attributes that are unique to the business concept. The Customer table of Figure 6 there was

two candidate keys, in this case CustomerNumber and SocialSecurityNumber. Second, you could introduce a new

column, called a surrogate key, which is a key that has no business meaning.  An example of which is the AddressID

column of the Address table in Figure 6. Addresses don’t have an “easy” natural key because you would need to use

all of the columns of the Address table to form a key for itself (you might be able to get away with just the

combination of Street and ZipCode depending on your problem domain), therefore introducing a surrogate key is a

much better option in this case.

 

Figure 6. Customer and Address revisited (UML notation).

Page 11: Data Modeling

 

Let's consider Figure 6 in more detail.  Figure 6 presents an alternative design to that presented in Figure 2, a

different naming convention was adopted and the model itself is more extensive. In Figure 6 the Customer table has

the CustomerNumber column as its primary key and SocialSecurityNumber as an alternate key. This indicates that

the preferred way to access customer information is through the value of a person’s customer number although your

software can get at the same information if it has the person’s social security number. The CustomerHasAddress

table has a composite primary key, the combination of CustomerNumber and AddressID. A foreign key is one or

more attributes in an entity type that represents a key, either primary or secondary, in another entity type. Foreign

keys are used to maintain relationships between rows. For example, the relationships between rows in the

CustomerHasAddress table and the Customer table is maintained by the CustomerNumber column within the

CustomerHasAddress table. The interesting thing about the CustomerNumber column is the fact that it is part of the

primary key for CustomerHasAddress as well as the foreign key to the Customer table. Similarly, the AddressID

column is part of the primary key of CustomerHasAddress as well as a foreign key to the Address table to maintain

the relationship with rows of Address.

Although the "natural vs. surrogate" debate is one of the great religious issues within the data community, the fact is

that neither strategy is perfect and you'll discover that in practice (as we see in Figure 6) sometimes it makes sense

to use natural keys and sometimes it makes sense to use surrogate keys. In Choosing a Primary Key: Natural or

Surrogate? I describe the relevant issues in detail.

 

3.7 Normalize to Reduce Data Redundancy

Page 12: Data Modeling

Data normalization is a process in which data attributes within a data model are organized to increase the cohesion

of entity types. In other words, the goal of data normalization is to reduce and even eliminate data redundancy, an

important consideration for application developers because it is incredibly difficult to stores objects in a relational

database that maintains the same information in several places. Table 2 summarizes the three most common

normalization rules describing how to put entity types into a series of increasing levels of normalization. Higher

levels of data normalization (Date 2000) are beyond the scope of this book. With respect to terminology, a data

schema is considered to be at the level of normalization of its least normalized entity type. For example, if all of

your entity types are at second normal form (2NF) or higher then we say that your data schema is at 2NF.

Table 2. Data Normalization Rules.

Level Rule

First normal form

(1NF)

An entity type is in 1NF when it contains no repeating groups of data.

Second normal form

(2NF)

An entity type is in 2NF when it is in 1NF and when all of its non-key

attributes are fully dependent on its primary key.

Third normal form

(3NF)

An entity type is in 3NF when it is in 2NF and when all of its attributes are

directly dependent on the primary key.

 

Figure 7 depicts a database schema in ONF whereas Figure 8 depicts a normalized schema in 3NF.  Read the

Introduction to Data Normalization essay for details.

Why data normalization? The advantage of having a highly normalized data schema is that information is stored in

one place and one place only, reducing the possibility of inconsistent data. Furthermore, highly-normalized data

schemas in general are closer conceptually to object-oriented schemas because the object-oriented goals of

promoting high cohesion and loose coupling between classes results in similar solutions (at least from a data point of

view). This generally makes it easier to map your objects to your data schema. Unfortunately, normalization

usually comes at a performance cost. With the data schema of Figure 7 all the data for a single order is stored in one

row (assuming orders of up to nine order items), making it very easy to access. With the data schema of Figure 7

you could quickly determine the total amount of an order by reading the single row from the Order0NF table. To do

so with the data schema of Figure 8 you would need to read data from a row in the Order table, data from all the

rows from the OrderItem table for that order and data from the corresponding rows in the Item table for each order

item. For this query, the data schema of Figure 7 very likely provides better performance.

 

Figure 7. An Initial Data Schema for Order (UML Notation).

Page 13: Data Modeling
Page 14: Data Modeling

 

Figure 8. A normalized schema in 3NF (UML Notation).

Page 15: Data Modeling

In class modeling, there is a similar concept called Class Normalization although that is beyond the scope of this

article.

Page 16: Data Modeling

 

3.8 Denormalize to Improve Performance

Normalized data schemas, when put into production, often suffer from performance problems. This makes sense –

the rules of data normalization focus on reducing data redundancy, not on improving performance of data access. An

important part of data modeling is to denormalize portions of your data schema to improve database access times.

For example, the data model of Figure 9 looks nothing like the normalized schema of Figure 8. To understand why

the differences between the schemas exist you must consider the performance needs of the application. The primary

goal of this system is to process new orders from online customers as quickly as possible. To do this customers need

to be able to search for items and add them to their order quickly, remove items from their order if need be, then

have their final order totaled and recorded quickly. The secondary goal of the system is to the process, ship, and bill

the orders afterwards.

 

Figure 9. A Denormalized Order Data Schema (UML notation).

Page 17: Data Modeling

 

To denormalize the data schema the following decisions were made:

1.To support quick searching of item information the Item table was left alone.

2.To support the addition and removal of order items to an order the concept of an OrderItem table was kept,

albeit split in two to support outstanding orders and fulfilled orders. New order items can easily be inserted

into the OutstandingOrderItem table, or removed from it, as needed.

3.To support order processing the Order and OrderItem tables were reworked into pairs to handle outstanding

and fulfilled orders respectively. Basic order information is first stored in the OutstandingOrder and

OutstandingOrderItem tables and then when the order has been shipped and paid for the data is then

removed from those tables and copied into the FulfilledOrder and FulfilledOrderItem tables respectively.

Data access time to the two tables for outstanding orders is reduced because only the active orders are being

Page 18: Data Modeling

stored there. On average an order may be outstanding for a couple of days, whereas for financial reporting

reasons may be stored in the fulfilled order tables for several years until archived. There is a performance

penalty under this scheme because of the need to delete outstanding orders and then resave them as fulfilled

orders, clearly something that would need to be processed as a transaction.

4.The contact information for the person(s) the order is being shipped and billed to was also denormalized back

into the Order table, reducing the time it takes to write an order to the database because there is now one

write instead of two or three. The retrieval and deletion times for that data would also be similarly

improved.

Note that if your initial, normalized data design meets the performance needs of your application then it is fine as is.

Denormalization should be resorted to only when performance testing shows that you have a problem with your

objects and subsequent profiling reveals that you need to improve database access time. As my grandfather said, if it

ain’t broke don’t fix it.

 

5. Evolutionary/Agile Data Modeling

Evolutionary data modeling is data modeling performed in an iterative and incremental manner.  The article

Evolutionary Development explores evolutionary software development in greater detail.  Agile data modeling is

evolutionary data modeling done in a collaborative manner.   The article Agile Data Modeling: From Domain

Modeling to Physical Modeling works through a case study which shows how to take an agile approach to data

modeling.

Although you wouldn’t think it, data modeling can be one of the most challenging tasks that an Agile DBA can be

involved with on an agile software development project.  Your approach to data modeling will often be at the center

of any controversy between the agile software developers and the traditional data professionals within your

organization.  Agile software developers will lean towards an evolutionary approach where data modeling is just one

of many activities whereas traditional data professionals will often lean towards a big design up front (BDUF)

approach where data models are the primary artifacts, if not THE artifacts.  This problem results from a combination

of the cultural impedance mismatch, a misguided need to enforce the "one truth", and “normal” political

maneuvering within your organization.  As a result Agile DBAs often find that navigating the political waters is an

important part of their data modeling efforts.

 

6. How to Become Better At Modeling Data

How do you improve your data modeling skills? Practice, practice, practice. Whenever you get a chance you should

work closely with Agile DBAs, volunteer to model data with them, and ask them questions as the work progresses.

Agile DBAs will be following the AM practice Model With Others so should welcome the assistance as well as the

questions – one of the best ways to really learn your craft is to have someone as “why are you doing it that way”.

You should be able to learn physical data modeling skills from Agile DBAs, and often logical data modeling skills as

well.

Page 19: Data Modeling

Similarly you should take the opportunity to work with the enterprise architects within your organization. As you

saw in Agile Enterprise Architecture they should be taking an active role on your project, mentoring your project

team in the enterprise architecture (if any), mentoring you in modeling and architectural skills, and aiding in your

team’s modeling and development efforts. Once again, volunteer to work with them and ask questions when you are

doing so. Enterprise architects will be able to teach you conceptual and logical data modeling skills as well as instill

an appreciation for enterprise issues.

You also need to do some reading. Although this article is a good start it is only a brief introduction. The best

approach is to simply ask the Agile DBAs that you work with what they think you should read.

My final word of advice is that it is critical for application developers to understand and appreciate the fundamentals

of data modeling. This is a valuable skill to have and has been since the 1970s. It also provides a common

framework within which you can work with Agile DBAs, and may even prove to be the initial skill that enables you

to make a career transition into becoming a full-fledged Agile DBA.

 

7. References and Suggested Online Readings

Agile/Evolutionary Data Modeling

Agile Database Best Practices

Agile Master Data Management (MDM)

Agile Modeling Best Practices

Choosing a Primary Key: Natural or Surrogate?

Comparing the Various Approaches to Modeling in Software Development

Data & Databases

Data Model Patterns

Data Modeling for Information Professionals

The Data Modeling Handbook

Database Modeling Within an XP Methodology (Ronald Bradford)

Initial High-Level Architectural Envisioning

Initial High-Level Requirements Envisioning

Introduction to Data Normalization

Logical Data Modeling: What It Is and How To Do It by Alan Chmura and J. Mark Heumann

On Relational Theory

The "One Truth Above All Else" Anti-Pattern

Prioritized Requirements: An Agile Best Practice

Survey Results (Agile and Data Management)

When is Enough Modeling Enough?

Page 20: Data Modeling

Data Modeling Techniques, Rules, and Diagram Conventions

Section 4 of the On-line Course:

Learning the Cadastral Data Content Standard

Technical Sections

Sections 4 through 8 are the sections of the Cadastral Data Content Standard educational course which

present detailed technical concepts about data models, crosswalks, translations, and maintenance of the

Standard.

Section 4 describes the entity relationship diagram and the definitions and relationships used in the Cadastral Data

Content Standard, clarifying the data modeling conventions used in the Standard's logical model. Please note that

data modeling is a precise and detailed discipline, often requiring a good bit of effort to gain a working knowledge. If

you are new to data modeling, keep in mind that the information presented here in Section 4 may require some extra

time and patience to understand.

Topics in Section 4:

Overview of the Model

Logical Models vs Physical Models

The Content Standard versus a 'Physical' Standard

Links and References to Information on Data Modeling

Overview of the Cadastral Data Content Standard Model

The Cadastral Data Content Standard model is an illustration of the objects in the Standard. The model is known as a

logical model, and is illustrated in an entity relationship diagram (or E-R diagram).

The logical model describes the definitions or semantics of the cadastral information referred to in the Standard. An

entity relationship diagram is a shorthand method for showing the associations among various objects in the

model, and the relationships between the objects.

The entity relationship diagram illustrates the model's objects, such as the entities, attributes, and the associations

(see *Note below).

A logical data model is not an implementation model. Implementation requires modifying the logical data model to

best fit operating software. This process, called denormalization, is the process of combining entities into tables in a

database that optimize the database operation.

See the diagram conventions discussion for more information about the E-R diagram used in the Cadastral Data

Content Standard.

Page 21: Data Modeling

(* Note: The term "association" is used throughout the Cadastral Data Content Standard to refer to descriptions of

how data entities are related to each other. Some people may be more accustomed to using the term "relationship",

and may wish to substitute that term for "association" while investigating the sections which describe the data

model.)

Logical Models vs Physical Models

The following is a description of the differences between logical models and physical models. Data modeling

professionals often note that there are varying ways of dealing with such details as keys, relationships, and

normalization. Accordingly, the description below has been kept as general as possible.

Logical models depict the true relationships of attributes as they are grouped into entities, relating attributes to

attributes and entities to entities. Logical models are not concerned with implementation, storage mechanisms, and

redundancy of data. Logical models are usually normalized. Normalized means that every attribute is independent,

that is, not dependent on any other attribute.

Physical models are concerned with the implementation of logical models, and are designed to account for data

storage, indexes, how to retrieve data, and how keys are concatenated. Physical models strive to optimize logical

models according to how data are going to be used, such as for reports, data entry, and analysis. Physical models take

into account the software that will be used, as well as whether the database will be relational, hierarchical or

network.

Entities do not have to be the same between the logical model and the physical model. That is, in order to

accomodate efficient use of data, a physical model may have a greater or fewer number of entities than a logical

model. The physical model assigns lengths to attribute fields. A physical model is usually de-normalized, that is,

attributes may be assigned values and dependencies with other attributes to support using the data. For example, an

attribute can be derived for one or more other attributes. The attribute is used daily for reporting purposes so the

derived attribute is stored in the data base to avoid daily recalculation.

The Content Standard versus a 'Physical' Standard

The Cadastral Data Content Standard is just that, a content standard. The Standard defines the kinds of entities,

attributes, range of values, and logical relationships which can go into a cadastral database. The Standard does not

define the actual structure of a database, and deals with none of the field definitions or software coding components

of a physical design.

For example, the cadastral standard provides a unique nation-wide identification of principal meridians. The names

of the principal meridians have been standardized and are listed in the Standard document. In a physical format for a

county or state that uses one of the principal meridians, it does not make sense to repeat that value for every record in

the county or state. In this case the physical format does not include the principal meridian as defined in the Standard

in the database. The value for the principal meridian can be generated and added upon data transfer or exchange.

In another example, an organization may decide they want their physical database to combine bearings and distances

and their units of measure in the same file as the record boundary. This might be done to accommodate a

Page 22: Data Modeling

computational package, to increase the ease of review of values, or to enhance search performance. The Cadastral

Data Content Standard does not provide for this kind of physical database design and use.

The physical structure of cadastral databases will be dealt with by the Cadastral Data Transfer Profile, which is

currently in development, and is described in Section 6.

Links and References to Information on Data Modeling

For more information on understanding data models, begin with the web sites for:

Applied Information Science

There is a commercial data modeling product from agpw, inc., known as Data Master. Though we have not

reviewed it and cannot endorse the product, you may find it to be worth investigating.

Published information on modeling includes:

Bruce, T.A., Designing Quality Databases with IDEF1X Information Models, Dorset House, 1992.

Chen, P.P.S., "The Entity-Relationship Model -- toward a unified view of data". ACM Transactions on Database

Systems 1, 1, March 1976.

Jackson, Michael A., System Development. Prentice Hall International, Englewood Cliffs, New Jersey, 1983.

Federal Information Processing Standards Publication 184, the Standard for Integration Definition for Information

Modeling, U.S. Department of Commerce, Technology Administration, National Institute of Standards and

Technology, December 1993.

One of the best short summaries of Bachman and Chen data modeling methods which we have found is in

McDonnell Douglas' ProKit WORKBENCH Application Manual, Chapter 8, Data Modeler. This is a proprietary

software documentation manual, so as far as we know it is not a book available for purchase. Contact McDonnell

Douglas (1-800-225-7760) if you are interested. (Note: In April 2002 it was pointed out to us that this document is

no longer easily available from McDonnel Douglas, and thus may be difficult to find.)

Surprisingly, there is virtually no widely accessible published information on the Charles Bachman method of data

modeling. A search on the subject of Bachman data modeling brought up the following articles:

Bachman Information Systems Data Base Management No. 6, The Entity Relationship Approach to Logical

Data Base Design. Q.E.D. Monograph Series (Wellesley: Q.E.D. Information Science, Inc. 1977).

C.W. Bachman "Data Structure Diagrams" Journal of ACM SIGBDP Vol 1 No 2 (March 1969) pages 4-10.

McFadden & Hoffer _Database Management_, 3e, Benjamin Cummings, 1991, ISBN 0-8053-6040-9 or Date

_An Introduction to Database Systems_, Volume 1, 5e, Addison Wesley, 1990, ISBN 0-201-51381-1

Charles W. Bachman: The Role Data Model Approach to Data Structures. 1-18, published in S. M. Deen, P.

Hammersley: Proceedings International Conference on Data Bases, University of Aberdeen, July 1980.

Heyden & Son, 1980, ISBN 0-85501-495-4.

Page 23: Data Modeling

This ends Course Sectin 4. Use the links below to return to the top of this page, or to go on to Section 5, or any of the

other Modules.

Page 24: Data Modeling

A Comparison of Data Modeling Techniques

David C. Hay

[This is a revision of a paper by the same title written in 1995. In addition to stylistic updates, this paper

replaces all the object modeling techniques with the UML – a new technique that is intended to replace at

least all these.]

Peter Chen first introduced entity/relationship modeling in 1976 [Chen 1977]. It was a brilliant idea that

has revolutionized the way we represent data. It was a first version only, however, and many people since

then have tried to improve on it. A veritable plethora of data modeling techniques have been developed.

Things became more complicated in the late 1980’s with the advent of a variation on this theme called

"object modeling". The net effect of all this was that there were now even more ways to model the structure

of data. This was mitigated somewhat in the mid-1990's, with the introduction of the UML, a modeling

technique intended to replace at least all the "object modeling" ones. As will be seen in this article, it is not

quite up to replacing other entity/relationship approaches, but it has had a dramatic effect on the object

modeling world.

This article is intended to present the most important of these and to provide a basis for comparing them

with each other.

Regardless of the symbols used, data or object modeling is intended to do one thing: describe the things

about which an organization wishes to collect data, along with the relationships among them. For this reason,

all of the commonly used systems of notation fundamentally are convertible one to another. The major

differences among them are aesthetic, although some make distinctions that others do not, and some do not

have symbols to represent all situations.

This is true for object modeling notations as well as entity/relationship notations.

There are actually three levels of conventions to be defined in the data modeling arena: The first is

syntactic, about the symbols to be used. These conventions are the primary focus of this article. The second

defines the organization of model diagrams. Positional conventions dictate how entities are laid out. These

will be discussed at the end of the article. And finally, there are conventions about how the meaning of a

model may be conveyed. Semantic conventions describe standard ways for representing common business

situations. These are not discussed here, but you can find more information about them in books by David

Hay [1996] and Martin Fowler [1997]

Page 25: Data Modeling

These three sets of conventions are, in principle, completely independent of each other. Given any of the

syntactic conventions described here, you can follow any of the available positional or semantic conventions.

In practice, however, promoters of each syntactic convention typically also promote at least particular

positional conventions.

In evaluating syntactic conventions, it is important to remember that data modeling has two audiences.

The first is the user community, that uses the models and their descriptions to verify that the analysts in fact

understand their environment and their requirements. The second audience is the set of systems designers,

who use the business rules implied by the models as the basis for their design of computer systems.

Different techniques are better for one audience or the other. Models used by analysts must be clear and

easy to read. This often means that these models may describe less than the full extent of detail available.

First and foremost, they must be accessible by a non-technical viewer. Models for designers, on the other

hand must be as complete and rigorous as possible, expressing as much as possible.

The evaluation, then, will be based both on the technical completeness of each technique and on its

readability.

Technical completeness is in terms of the representation of:

o Entities and attributes

o Relationships

o Unique identifiers

o Sub-types and super-types

o Constraints between relationships

A technique’s readability is characterized by its graphic treatment of relationship lines and entity boxes,

as well as its adherence to the general principles of good graphic design. Among the most important of the

principles of graphic design is that each symbol should have only one meaning, which applies where ever that

symbol is used, and that each concept should be represented by only one symbol. Moreover, a diagram should

not be cluttered with more symbols than are absolutely necessary, and the graphics in a diagram should be

intuitively expressive of the concepts involved.. [See Hay 98.]

Each technique has strengths and weakness in the way it addresses each audience. As it happens, most

are oriented more toward designers than they are toward the user community. These produce models that are

very intricate and focus on making sure that all possible constraints are described. Alas, this is often at the

expense of readability.

This document presents seven notation schemes. For comparison purposes, the same example model is

presented using each technique. Note that the UML is billed as an "object modeling" technique, rather than as

a data (entity/relationship) modeling technique, but as you will see, its structures is fundamentally the same.

This comparison is in terms of each technique’s symbols for describing entities (or "object classes", for the

Page 26: Data Modeling

UML), attributes, relationships (or object-oriented "associations"), unique identifiers, sub-types and

constraints between relationships. The following notations are presented here.

At the end of the individual discussions is your author’s argument in favor of Mr. Barker’s approach for

use in requirements analysis, along with his argument in favor of UML to support design.

Relationships

Mr. Chen’s notation is unique among the techniques shown here in that a relationship is shown as a two-

dimensional symbol — a rhombus on the line between two or more entities.

Note that this relationship symbol makes it possible to maintain a "many-to-many" relationship without

necessarily converting it into an associative or intersect entity. In effect, the relationship itself is playing the

role of an associative entity. The relationship itself is permitted to have attributes. Note how "quantity",

"actual price", and "line number" are attributes of the relationship Order-line in Figure 1.

Page 27: Data Modeling

Note also that relationships do not have to be binary. As many entities as necessary may be linked to a

relationship rhombus.

Cardinality/optionality

In Mr. Chen’s original work, only one number appeared at each end, showing the maximum

cardinality. That is, a relationship might be "one to many", with a "1" at one end and a "n" at the

other. This would not indicate whether or not an occurrence of an entity had to have at least one

occurrence of the other entity.

In most cases, an occurrence of an entity that is related to one occurrence of another must be related

to one, and an occurrence of an entity that is related to more than one may be related to none, so most

of the time the lower bounds can be assumed. The event/event category model, however, is unusual.

Having just a "1" next to event showing that an event is related to one event category would not show

that it might be related to none. The "n" which shows that each event category is related to more than

one event would not show that it must be related to at least one.

For this reason, the technique can be extended to use two numbers at each end to show the minimum

and maximum cardinalities. For example, the relationship party-order between purchase order and

party, shows 1,1 at the purchase order end, showing that each purchase order must be with no less

than one party and no more than one party. At the other end, "0,n" shows that a party may or may not

be involved with any purchase orders, and could be involved with several. The event/event category

model would have "0,1" at the event end, and "1,n" at the event category end.

In an alternative notation, relationship names may be replaced with "E" if the existence of occurrences

of the second entity requires the existence of a related occurrence of the first entity.

Names

Because relationships are clearly considered objects in their own right, their names tend to be nouns.

The relationship between purchase-order and person or organization, for example, is called order-line.

Sometimes a relationship name is simply a concatenation of the two entity names. For example party-

order relates party and purchase order.

Entity and relationship names may be abbreviated.

Unique identifiers

A unique identifier is any combination of attributes and relationships that uniquely identify an occurrence of

an entity.

While Mr. Chen recognizes the importance of attributes as entity unique identifiers [Chen 1977, 23], his

notation makes no provision for showing this. If the unique identifier of an entity includes a relationship to a

second entity, he replaces the relationship name with "E", makes the line into the dependent entity an arrow,

and draws a second box around this dependent entity. (Figure 2 shows how this would look if the relationship

Page 28: Data Modeling

to party were part of the unique identifier of purchas-order). This still does not identify any attributes that are

part of the identifier.

Figure 2: Existence Dependent Relationship

Sub-types

A sub-type is a subset of the occurrences of another entity, its super-type. That is, an occurrence of a sub-type

entity is also an occurrence of that entity’s super-type. An occurrence of the super-type is also an occurrence

of exactly one or another of the sub-types.

Though not in Mr. Chen’s original work, this extension is described By Robert Brown [1993] and Mat Flavin

[1981].

In this extension, sub-types are represented by separate entity boxes, each removed from its super-type and

connected to it by an "isa" relationship. (Each occurrence of a sub-type "is a[n]" occurrence of the super-

type.) The relationship lines are linked by a rhombus and each relationship to a sub-type has a bar drawn

across it. In Figure 1, for example, party is a super-type, with person and organization as its sub-types. Thus

an order-line must be either a product or a service. This isn’t strictly correct, since an order line is the fact

that a product or a service was ordered on a purchase-order. It is not the same thing as the product or service

themselves.

Constraints between relationships

The most common case of constraints between relationships is the "exclusive or", meaning that each

occurrence of the base entity must (or may) be related to occurrences of one other entity, but not more than

one. These will be seen in most of the techniques which follow below.

Mr. Chen does not deal with constraints directly at all. This must be done by defining an artificial entity and

making the constrained entities into sub-types of that entity. This is shown in Figure 1 with the entity

catalogue item, with its mutually exclusive sub-types product and service. Each purchase order has an order-

line relationship with one catalogue item, where each catalogue item must be either a product or a service.

Comments

Mr. Chen was first, so it is not surprising that his technique does not express all the nuances that have been

included in subsequent techniques. It does not annotate characteristics of attributes, and it does not show the

identification of entities without sacrificing the names of the relationships.

Page 29: Data Modeling

While it does permit showing multiple inheritance and multiple type hierarchies, the multi-box approach to

sub-types takes up a lot of room on the drawing, limiting the number of other entities that can be placed on it.

It also requires a great deal of space to give a separate symbol to each attribute and each relationship.

Moreover, it does not clearly convey the fact that an occurrence of a sub-type is an occurrence of a super-

type.

Page 30: Data Modeling

Live chat by Boldchat

DM STAT-1 Consulting's founder and  President Bruce Ratner, Ph.D. has made the company

the ensample for Statistical Modeling & Analysis and  Data Mining in Direct & Database

Marketing, Customer Relationship Management, Business Intelligence, and Information

Technology.  DM STAT-1 specializes in the full range of standard statistical techniques,        

and  methods  using hybrid statistics-machine learning algorithms,  such as its patented

GenIQ Model©  Data Mining, Modeling & Analysis Software, to achieve its Clients' Goals

- across industries of Banking, Insurance, Finance, Retail, Telecommunications, Healthcare,

Pharmaceutical, Publication & Circulation, Mass & Direct Advertising, Catalog Marketing,

Online Marketing, Web-mining, B2B, Human Capital Management, Risk Management, and

Nonprofit Fundraising. Bruce’s par excellence consulting expertise is clearly apparent as he

wrote  the best-selling book  Statistical Modeling and  Analysis  for Database Marketing:

Effective Techniques for Mining Big Data. (based on Amazon Sales Rank).

Clients' Goals include: 

Results-Oriented : Increase Response Rates; Drive Costs Down and Revenue Up;

Increase Customer Retention; Stem Attrition; Check Churn; Increase Customer

Affinity - Match Products with Customer Needs; Enhance Collections & Recovery

Efforts; Improve Risk Management;  Strengthen Fraud Detection Systems; Increase

Number of Loans without Increasing Risk; Work Up Demographic- based Market

Segmentation for Effective Product Positioning; Perform Retail Customer

Segmentation for New Marketing Strategies; Construct New Business Acquisition

Segmentation to Increase Customer Base; Identify Best Customers: Descriptive,

Predictive and Look-Alike Profiling to Harvest Customer Database; Increase Value of

Customer Retention; Generate Business-to-Business Leads for

Increase Profitability; Target Sales Efforts to Improve Loyalty Among the Most

Profitable Customers; Improve Customer Service by Giving Marketing and Sales

Better Information; Build CRM Models for Identifying High-value Responders; Build

CRM Models to Run Effective Marketing Campaigns; Improve Human Resource

Page 31: Data Modeling

Management -   Retain the Best Employees; Optimize Price and Package Offerings;

Right Offer at the Right Time with the Right Channel; Maintain Product Profitability

and Support Effective Product Management; Increase the Yield of Nonprofit

Fundraising Campaigns; Optimize Customer Loyalty; CRM for Cross-Sell and Up-Sell

to Improve Response Rates and Increase Revenue; CRM Segmentation for Targeted

Marketing; Workforce Optimization; Personalize Recommendations for Information,

Products or Services; Credit Scoring to Control Risk; Retain Best Customers and

Maximize Their Profits; Nonprofit Modeling: Remaining Competitive and Successful;

Subprime Lender Short Term Loan Models for Credit Default and Exposure; Retail

Revenue Optimization: Accounting for Profit-eating Markdowns; Nonprofit Modeling:

Remaining Competitive and Successful; Detecting Fraudulent Insurance Claims;

Demand Forecasting for Retail; Cross-Sell and Up-Sell to Improve Response Rates

and Increase Revenue; Credit Scoring for Controlling Risk; and so on.

Analytical Strategy :  Build, Score and Validate Logistic Regression Models, Ordinary

Regression Models, Variant Regression-based Models, Decision-Tree Models,

Machine-Learning Models, Quasi-Experimental Design Models, Marketing

Mix Optimization Models; Latent Class Models, Survival/Proportional Hazards

Models, and Structural Equation Models, Machine-Learning Conjoint Analysis, and all

other models in the data analyst's tool kit for problem-solution approaches.

o Model Types : Acquisition/Prospect Models, Retention Models, Attrition

Models, LifetimeValue Models, Credit Risk Models, Response-Approval

Models, Contact-Conversion Models,  Contact-Profit Models, Customer-Value

Based Segmentation Models; Credit Scoring Models, Web-traffic Models,

Balanced Scorecard Models, Cross-sell/Up-sell Models, Zipcode-based

Models, Blockgroup-based Models Decision-Tree Inventory  Forecast Models,

Models for Maximizing Profits from Solicitations, Mortgage and Credit Card

Default Models, Trigger Marketing Model, Fraud Detection: Beyond the Rules-

Based Approach, Workforce Optimization Model, Collaborative Filtering

Systems, and an assortment of results-related analytical strategies.

Analytical Tactics:  Procedure When Statistical Model Performance is Poor; Procedures

for Data that are Too Large to be Handled in the Memory of Your Computer;

Procedures for Data that Are Too Large to be Handled in the Memory of Your

Computer; Detecting Whether the Training and Hold-out Subsamples Represent the

Same Universe to Insure that the Validation of a Model is Unbiased; Data Preparation

for Determining Sample Size; Data Preparation for Big Data;  The Revised 80/20

Rule for Data Preparation; Implement Data Cleaning Methods; Guide Proper Use of

the Correlation Coefficient; Understand Importance of the Regression Coefficient;

Effect Handling of Missing Data, and Data Transformations; High Performance

Computing for Discovering Interesting and Previously Unknown Information in -

credit bureau, demographic, census, public record, and behavioral

databases;  Deliverance of Incomplete and Discarded Cases; Make Use   of Otherwise

Discarded Data; Determine  Important Predictors; Determine How Large a Sample is

Page 32: Data Modeling

Required; Automatic Coding of Dummy Variables; Invoke Sample Balancing;

Establish Visualization Displays; Uncover and Include Linear Trends and 

Seasonality Components in Predictive Models; Modeling a Distribution with a Mass at

Zero; Upgrading Heritable Information; "Smart" Decile Analysis for Identifying

Extreme Response Segments; A Method for Moderating Outliers, Instead of

Discarding Them; Extracting Nonlinear Dependencies: An Easy, Automatic Method;

The GenIQ Model: A Method that Lets the Data Specify the Model; Data Mining

Using Genetic Programming; Quantile Regression: Model-free Approach; Missing

Value Analysis: A Machine-learning Approach; Gain of a Predictive Information

Advantage: Data Mining via Evolution; and many more analytical strategy-related

analytical tactics.

Page 33: Data Modeling

The Banking Industry Problem-Solution: Reduce Costs, Increase Profits by Data Mining and Modeling

Bruce Ratner, Ph.D.</STRONG

In today’s slow-moving economy the banking industry is in tough competitive “boxing ring,” in which they are

getting hit with high customer attrition rates. And, achieving their goals – reduce costs and increase profit – is a

matter of “survival of the fittest.” Fortuitously, their gargantuan volumes of transaction data gathered daily are the

key ingredient for achieving their goals. High performance computing for discovering interesting and previously

unknown information within the gargantuan data is needed as part of a tactical analytical strategy to build models to

win their goals. Traditional statistical approaches are virtually ineffectual at data mining, i.e., uncovering undetected

cost-reduction/profit-gaining predictive relationships. This knowledge is vitally necessary for building models for

reducing costs, and increasing profits. The purpose of this article is to demonstrate the strength of the data mining

muscle of the genetic data-mining feature of the GenIQ Model©. I discuss case studies, which use the “body blows”

of genetic data mining to produce victorious cost-reduction, and profit-gaining models. For an eye-opening preview

of the 9-step modeling process of GenIQ, click here. For FAQs about GenIQ, click here.

Page 34: Data Modeling

The Banking Industry Problem-Solution: Reduce Costs, Increase Profits by Data Mining and Modeling

Bruce Ratner, Ph.D.</STRONG

In today’s slow-moving economy the banking industry is in tough competitive “boxing ring,” in which they are

getting hit with high customer attrition rates. And, achieving their goals – reduce costs and increase profit – is a

matter of “survival of the fittest.” Fortuitously, their gargantuan volumes of transaction data gathered daily are the

key ingredient for achieving their goals. High performance computing for discovering interesting and previously

unknown information within the gargantuan data is needed as part of a tactical analytical strategy to build models to

win their goals. Traditional statistical approaches are virtually ineffectual at data mining, i.e., uncovering undetected

cost-reduction/profit-gaining predictive relationships. This knowledge is vitally necessary for building models for

reducing costs, and increasing profits. The purpose of this article is to demonstrate the strength of the data mining

muscle of the genetic data-mining feature of the GenIQ Model©. I discuss case studies, which use the “body blows”

of genetic data mining to produce victorious cost-reduction, and profit-gaining models. For an eye-opening preview

of the 9-step modeling process of GenIQ, click here. For FAQs about GenIQ, click here.

Page 35: Data Modeling

Demand Forecasting for Retail:

A Genetic Approach

Bruce Ratner, Ph.D.

Accurate demand forecasting is essential for retailers to minimize the risk of stores running out of a product, or not

having enough of a popular brand, color or style. Preseason and in-season forecast errors account for 20 to 25

percent of losses in sales. Traditional demand forecasting methods for all stock-keeping units (SKUs) across all

stores and all geographies have an inherent weakness of no ability to data mine the volumes of time-series data at the

SKU-level. The purpose of this article is to present a machine learning approach – the GenIQ Model© – for demand

forecasting that has demonstrated superior results compared to the traditional techniques.

For more information about this article, call Bruce Ratner at 516.791.3544,

1 800 DM STAT-1, or e-mail at [email protected].

DM STAT-1 website visitors will receive my latest book Statistical Modeling and Analysis for Database Marketing:

Effective Techniques for Mining Big Data at a 33%-off discount plus shipping costs - just for the asking.