abridged lecture notes for dcs210 - auwal genemystudents/lecturenotes/... · abridged lecture notes...

DCS210 Course Manual: A Gentle Introduction to Database Normalization 2014 © IACC, ABU Zaria

PAGE 1 Downloaded from http://www.auwalgene.com/mystudents/lecturenotes

FREE, NOT FOR SALE!

Abridged Lecture Notes For

DCS210 Introduction to Database Management (II)

prepared and delivered by

Adamu Auwal Gene MCPN @IACC, Ahmadu Bello University, Zaria – Nigeria

Last Updated: March, 2014



FREE, NOT FOR SALE!

A Gentle Introduction to

Database Normalization

Abridged Lecture Notes For

DCS210 Introduction to Database Management (II)

Diploma in Computer Science

Year II, Semester II

Prepared and Delivered

By

Adamu Auwal Gene MCPN Chartered Information Technology Professional

@Iya Abubakar Computer Centre, Ahmadu Bello University, Zaria – Nigeria

Last Updated: March, 2014



FREE, NOT FOR SALE!

CONTENTS

INTRODUCTION ................................................................................................................................................... 5

JUST SOME FEW RULES PLEASE, BEFORE WE START ......................................................................................... 6

SECTION I: BASIC CONCEPTS OF NORMALIZATION ........................................................................................... 8

1.1: So, what is normalization? .................................................................................................................. 8

1.2: Anomalies ............................................................................................................................................ 9

1.2.1: Insert anomaly ........................................................................................................................... 10

1.2.2: Delete anomaly .......................................................................................................................... 10

1.2.3: Update anomaly ........................................................................................................................ 10

1.3: Purpose or goals of normalization..................................................................................................... 10

1.4: Advantages and disadvantages of normalization .............................................................................. 11

1.5: Functional Dependencies (FD) ........................................................................................................... 12

1.6: Partial Dependencies ......................................................................................................................... 14

1.7: Transitive Dependencies ................................................................................................................... 14

1.8: Normal Forms (NF) ............................................................................................................................ 15

1.8.1: First Normal Form (1NF) ............................................................................................................ 16

1.8.2: Second Normal Form (2NF) ....................................................................................................... 18

1.8.3: Third Normal Form (3NF) .......................................................................................................... 19

1.8.4: Other Normal Forms .................................................................................................................. 20

1.9: Class Exercises: Can You? ................................................................................................................. 21

1.10: Extra Credit: Can You? .................................................................................................................. 22

SECTION II: NORMALIZATION BY EXAMPLE ..................................................................................................... 23

2.1: Case-Study Introduction .................................................................................................................... 23

2.2: Conversion to First Normal Form ...................................................................................................... 25

2.2.1: Step 1: Eliminate the repeating groups ..................................................................................... 26

2.2.2: Step 2: Identify the primary key ................................................................................................ 26

2.2.3: Step 3: Identify all dependencies .............................................................................................. 26

2.3: Conversion to Second Normal Form ................................................................................................. 28

2.3.1: Step 1: Identify all the key components .................................................................................... 28

2.3.2: Step 2: Identify the dependent attributes ................................................................................. 29

2.4: Conversion to Third Normal Form ..................................................................................................... 30

2.4.1: Step 1: Identify each new determinant ..................................................................................... 30

2.4.2: Step 2: Identify the dependent attributes ................................................................................. 31



FREE, NOT FOR SALE!

2.4.3: Step 3: Remove the dependent attributes from transitive dependencies ............................... 31

2.5: Improving the Design ........................................................................................................................ 32

SECTION III: NORMALIZATION EXTRA CREDITS (CAN YOU?) ........................................................................... 35

3.1: Extra Credit 1: Worked Example ....................................................................................................... 35



SECTION IV: USEFUL DATABASE TERMS AND DEFINITIONS ............................................................................ 40



FREE, NOT FOR SALE!

INTRODUCTION

ELCOME TO the second part of your course on Introduction to Database

Management. Of course you have been learning about and working with

databases since last semester here at IACC; and the term "normalization"

is not new to you – we have referred to it more than a dozen times in last

semester's course, DCS209. You also might have heard phrases or sentences like "the

database is not [correctly] normalized" or "My database is now in BCNF" and so on. All

these terminologies may sound somewhat academic or intimidating but trust me, you

need not be scared about normalization anymore once we finish this course manual

(and especially when you attend my lectures and labs punctually, consistently and

attentively.

In this course manual, you will be introduced to the basic concept of database

normalization, taking a brief look at the most common normal forms. Your future

explorations of the principles of database design and implementations will provide

more in-depth principles and practices of the normalization process.

Please be reminded that this manual covers only a topic in the whole of your DCS210

syllabus, so be sure to go online and download the all other manuals for DCS210 and

more useful resources at my website, which can be accessed anytime at

http://www.auwalgene.com/mystudents/lecturenotes

IMPORTANT NOTE: It is YOUR PERSONAL RESPONSIBILITY to download, print and

bind all course materials as advised. Your C.A. and exam questions will always be

set from what has been covered in these manuals and in the class. As a result

therefore, your final scores in this course will depend largely on how promptly you

get and study all the materials, as well as how regularly and attentively you

participate in class and lab sessions.

Finally, I take no responsibility for any spelling or grammaticl errors found in this

manual. My written English is probably none too good, so I won't take offence at any

corrections from any Grammar Nazi out there: I am fluent in programming languages,

not the English language!

Best regards, and happy database normalization!

M-Auwal Gene mcpn @IACC, ABU Zaria March, 2014

W




FREE, NOT FOR SALE!

JUST SOME FEW RULES PLEASE, BEFORE WE START

1: Attendance Policy: Please note

that all students are expected to

attend every class and lab session on

time. Punctuality is expected, and is

part of your cumulative continuous

assessment. In case of unexpected

events that make it impossible for

any student to attend class or lab

sessions, such students should

contact me (or any other Instructor

in charge) via phone call or send an

SMS text message briefly explaining

why they would not be in the class or

lab.

2: Extra Credit: Occasionally there

are opportunities for students to earn

extra credits for exceptionally

excellent work or enthusiastic

attitude towards study in the this

course. There is no guarantee that

there will be extra credit

opportunities every time; but

whenever the opportunity arises, all

students will have an equal chance of

earning those extra credits.

Maximum extra credit obtainable by

any student is 5 points (out of 100).

3: Assignments: To evaluate

students’ learning progress, one or

more take-home assignments shall be

given to students at the end of every

class or lab session. Those

assignments will mostly be based on

current topics being discussed; but

may also sometimes include work

outside of the current topic.

4: Make-Up and Late Policy: All

assignments that are handed in late

will be docked 2% per day that they

are late, unless arrangements have

been made at least 48 hours before

the due date. The term “LATE” refers

to all assignments that are turned in

after the class or lab time on the

assignment’s due date. Please note

that I am not responsible for you not

having your personal laptop, or not

having Internet access, or not having

access to the lab computers to enable

you do your assignments. You will

normally be given freedom to do all

practical assignments in the lab if you

properly approach the Centre’s

Operations Manager or any of the Lab

Support Staff on duty.



FREE, NOT FOR SALE!

5: Grading Policy: The following

grading policy shall apply during this

course (both theory and practical

labs are covered):

Please note that every student’s

grade totally depends on what he or

she has achieved during the course:

the grades will be earned, not given!

6: Lab Etiquette: Since we are a

large class in a large lab, let us all

faithfully follow these four simple

rules in order to make life easy for

everyone:

i. Be punctual. Coming in late

disrupts your fellow students.

If you are going to be late for a

lab session, perhaps you should

not bother coming to the lab, as

you might not be able to catch

up anyway.

ii. Do not leave the lab early

unless it is an emergency.

iii. No texting, phone calls or

Internet browsing during class

or lab sessions.

iv. Kindly turn off cell phones, and

Internet access. If your phone

rings during a class or lab

session or you are seen

browsing the ’Net during a

class or lab session, you shall

be penalized – and your

penalty is to provide snacks

and drinks for the Instructor at

the next class or lab session.



FREE, NOT FOR SALE!

SECTION I: BASIC CONCEPTS OF NORMALIZATION

1.1: So, what is normalization?

It will be great if we begin from the beginning, right? So we are going to start by

trying to understand clearly what normalization actually is. In database

management systems, normalization is a logical database design method. In

simple English, it is a process of systematically breaking a complex database table

into simpler ones so as to efficiently organize data in that database.

Put in another way, we can say normalization is a process in which database tables

are systematically examined for anomalies and, when detected, remove those

anomalies by splitting up the table into two new, related, tables.

If you like, you may also say normalization is a process for evaluating and

correcting table structures to minimize data redundancies, thereby helping to

eliminate data anomalies. It helps us evaluate table structures and produce

good tables.

In essence, normalization is the process of eliminating “bad” dependencies by

splitting up tables and linking them with foreign keys.

It is a formal process of decomposing relations with anomalies to produce

smaller, well-structured and stable relations (tables). Primarily, it is a tool to

validate and improve a logical design so that it satisfies certain constraints that

avoid unnecessary duplication of data.

Normalization is a very important part of the database development process; as it

is often during normalization that database designers get their first real look into

how the data are going to interact in the database.

Finding problems with the database structure at this stage is strongly preferred

to finding problems later on in the development process after so much work has

been done (wrongly).

In a short while, we shall understand more about normalization and try it out

ourselves; but for now, let us find out why we need normalization in database

design and what are the pros and cons of normalization.



FREE, NOT FOR SALE!

Fig. 1: Normalization Concept Map

1.2: Anomalies

In relational database design, we not only want to create a structure that stores

all of the data, but we also want to do it in a way that minimize potential errors

when we work with the data. The default language for accessing data from a

relational database is SQL. You will recall that SQL can be used to manipulate data

in the following ways: insert new data, delete unwanted data, and update existing

data. Similarly, in an un-normalized design, there are 3 problems that can occur

when we work with the data:



FREE, NOT FOR SALE!

1.2.1: Insert anomaly: This refers to the situation when it is impossible to insert

certain types of data into the database or, if data must be inserted anyway, then

insertion of new rows forces user to create duplicate data. In short, insert

anomaly occurs when extra data beyond the desired data must be added to a

table.

1.2.2: Delete anomaly: The deletion of data leads to unintended loss of additional

data, data that we had wished to preserve.

1.2.3: Update anomaly: This refers to the situation where updating the value of a

column leads to database inconsistencies (i.e., because data is duplicated,

changing data in a row forces changes to other rows otherwise unchanged data in

other row will cause those data to have inconsistent values). It occurs when it is

necessary to change multiple rows to modify ONLY a single fact.

To address the three problems above, we go through the process of

normalization. When we go through the normalization process, we increase the

number of tables in the database, while decreasing the amount of data stored in

each table. There are several different levels of database normalization as you

will learn later.

1.3: Purpose or goals of normalization

There are two important goals or

reasons for normalization:

i. to eliminate redundant data

(i.e. ensuring that the same

data is not stored in more than

one table). This improves

consistency.

ii. to ensure that data

dependencies make sense (only

storing related data in a table).

Some additional goals of

normalization include:

iii. to avoid or minimize anomalies

(i.e. insertion, deletion and

update anomalies).



FREE, NOT FOR SALE!

iv. to provide maximum flexibility

to meet future information

needs by keeping tables

corresponding to object types

in their simplified forms.

v. to produce a clearer and

readable data model.

1.4: Advantages and disadvantages of normalization

Advantages:

i. Reduce data redundancy & space required

ii. Enhance data consistency

iii. Enforce data integrity

iv. Reduce update cost

v. Provide maximum flexibility in responding ad hoc queries

vi. Allow the use of parallelism,

vii. Can reduce the total number of rows per block.

Disadvantages:

i. Many complex queries will be slower because joins have to be performed to

retrieve relevant data from several normalized tables

ii. Programmers/users have to understand the underlying data model of a

database application in order perform proper joins among several tables

iii. The formulation of multiple-level queries is a very daunting, non-

trivial task.

Now, before we go any further, let us

recall that the relational model we

have been studying since last

semester (DCS209) consists of the

elements: relations (or tables), which

are made up of attributes (or

columns). We have learnt that:

A relation or table is a set of

attributes or columns with values

for each attribute such that:



FREE, NOT FOR SALE!

1. Each attribute (column) value must

be atomic (i.e. a single value only).

2. All values for a given attribute

(column ) must be of the same data

type.

3. Each attribute (column) name must

be unique.

4. The sequence of attributes

(columns) is insignificant.

5. No two tuples (rows) in a table

should be identical.

6. The sequence of the tuples (rows) is

insignificant.

You will recall also, that from our

discussion of E-R Modeling, we

know that an entity typically

corresponds to a relation (or

table, if you like) and that the

entity’s attributes become

columns of the table.

We also discussed how, depending

on the relationships between

entities, copies of attributes (the

identifiers ) could be placed in

related tables, where they become

foreign keys.

From here, if we remember all our

fundamental discussions very well as

summarized above, then we can

begin to delve into normalization by

looking first at how to identify

functional dependencies within

relations or tables. But if we still have

issues with our fundamental

concepts of relational databases,

please refer back to your DCS209

notes and have a thorough revision,

then come and join us when you have

gotten those fundamental concepts

clear.

1.5: Functional Dependencies (FD)

A functional dependency or FD for

short describes a relationship

between attributes within a single

relation. That is, functional

dependency is about the relationship

between the columns of a table. So if

you have a table with two or more

columns, we say one column is

functionally dependent on another if

we can use the value of one column

to determine the value of another. A

simple example will make this clear:

Fig. 2



FREE, NOT FOR SALE!

Say, we have a StudentsTable relation as shown in Fig. 2 for example. We can

know the name of any student if we know his/her registration number by writing

a SELECT query like this for example:

SELECT student_name, gender, points_earned

FROM StudentsTable WHERE reg_num = 'D14-005';

And that should return AUWAL

MUKHTAR GENE in the query result.

With this example, we say the

student_name attribute (or column)

of the StudentsTable relation is

functionally dependent on the

reg_num attribute (or column)

because reg_num can be used to

uniquely determine the value of

student_name.

That was easy for you to grasp, I

hope. Now, there are standard

conventions used to

communicate functional

dependency notations.

Generally, the arrow symbol → is

used to indicate a functional

dependency. So, we may have

something like: X → Y, which is read

or interpreted as "X functionally

determines Y" or, reading in the

reverse, we say "Y is functionally

dependent on X". So, for our

preceding reg_num and

student_name example above, we

may write reg_num →

student_name.

NOTE:

The attributes listed on the left hand side of the → symbol are called determinants.

One can also read A → B as, “A determines B”. Or more specifically: "Given

a value for A, we can uniquely determine one value for B".

A key (maybe a primary key, for example) functionally determines a tuple (row). So one functional dependency that can always be written is:

The Key → All other attributes

Not all determinants are keys, however!

Functional dependency require that the value for a certain set of attributes

determines uniquely the value for another set of attributes.



FREE, NOT FOR SALE!

A functional dependency is a generalization of the notion of a key.

There is a great deal of mathematical theories behind normalization, but

we shall not concern ourselves with all of that in this introductory course.

1.6: Partial Dependencies

In a database, a partial dependency occurs when an attribute is dependent only

partially on the primary key, as opposed to the primary key in its entirety. In this

case, the primary key is a composite key. Having a partial dependency will violate

the second normal form. To remove these dependencies, separate tables will

need to be created so normalization is possible.

Example: Your instructor will come up with one or more examples in class to

illustrate partial dependencies. Please be present and attentive!

1.7: Transitive Dependencies

Transitive dependencies occur when

there is an indirect relationship that

causes a functional dependency.

Third Normal Form usually deals

with transitive dependencies. This

means if we have a primary key A

and a non-key domain B and C where

C is more dependent on B than A and

B is directly dependent on A, then C

can be considered transitively

dependant on A.

For example, ”A → C” is a transitive dependency when it is true only because both

“A → B” and “B → C” are true.

Another way to look at it is a bit like a

stepping stone across a river. If we

consider the primary key A to be the

far bank of the river and our non-key

domain C to be our current location,

in order to get to A, our primary key,

we need to step on a stepping stone

B, another non-key domain, to help

us get there. Of course we could jump

directly from C to A, but it is easier,

and we are less likely to fall in, if we

use our stepping stone B. Therefore

current location C is transitively

dependent on A through our stepping

stone B (see Fig. 2).



FREE, NOT FOR SALE!

GOOD TO KNOW

From a structural point of view,

2NF is better than 1NF, and 3NF is

better than 2NF. For most business

database design purposes, 3NF is

as high as we need to go in the

normalization process. And some

very specialized applications may

require normalization beyond 4NF.

Note: A transitive dependency can

occur only in a relation that has three

or more attributes.

Fig. 2: Transitive dependency illustrated.

1.8: Normal Forms (NF)

Normalization works through a series of stages called normal forms.

There are quite a number of normal forms as

listed below:

First Normal form (1NF)

Second normal form (2NF)

Third normal form (3NF)

Boyce-Codd Normal Form (BCNF)

Forth Normal (4NF)

Fifth Normal (5NF)

Domain-key normal form (DKNF)

Although normalization is a very important database design ingredient, you

should not assume that the highest level of normalization is always the most

desirable. Generally, the higher the normal form, the more SQL joins are required

to produce a specified output and the more slowly the database system responds

to end-user demands. A successful design must, therefore, always consider end-

user demand for fast performance. So, you will occasionally be expected to

"denormalize" some portions of a database design in order to meet performance

requirements.



FREE, NOT FOR SALE!

1.8.1: First Normal Form (1NF)

A database relation (table) is in first normal form (1NF) if and only if it satisfies

the following two key conditions:

i. Contains only atomic values

ii. There are no repeating groups or duplicate rows

(and according to some authors, an additional requirement is that entries in any

given column should be of the same kind.)

Explanation 1:

In relational database parlance, an "atomic value" is a value that cannot be

divided. For example, in a table that has [RowID], [StudentNames], [Gender] and

[ContactAddress] as its fields, the [StudentNames] field may contain values like

"Muhammad-Auwal Gene", "Aremu Oluwakemi Juliet" and so on; while the

[ContactAddress] field may contain values like "No. 18, Usman Akilu Street,

Kaduna" or "23/25, Mora Road, Tudun Wada, Zaria, Kaduna State".

Fig. 3

In the examples above, the contents of both the [StudentNames] and

[ContactAddress] fields are not atomic, because the [StudentNames] field can be

broken into surname and other_names (or surname, middle_name and

last_name); while the [ContactAddress] field can be broken into house_number,

street_name, city_name and state. So, such a table is not in 1NF because the

values are not atomic.



FREE, NOT FOR SALE!

Explanation 2:

In relational database speak, a "repeating group" means that a table contains

two or more columns that are closely related. For example, a table that records

data on a book and its author(s) with the following columns: [Book ID], [Author

1], [Author 2], [Author 3] is not in 1NF because [Author 1], [Author 2], and

[Author 3] are all repeating the same attribute.

Important Note: The requirement that there be no duplicated rows in the table

means that the table should have a key (although the key might be made up of

more than one column – even, possibly, of all the columns).

Question: So, how do we correct a table that is not in 1NF to become 1NF? Let's

discuss in class.

1NF Example: Consider the following example:

This table is not in first normal form because the [PhoneNumbers] column is

allowed to contain multiple values. For example, the second row includes values

"08032126160" and "07067430539".

To bring this table to first normal form, we split the table into two tables and now

we have the resulting tables as follows:



FREE, NOT FOR SALE!

Now first normal form is satisfied, as the columns on each table all hold just one

value. Note that we had to split the [ContactName] field in MyContactBasics table

too!

1.8.2: Second Normal Form (2NF)

A database relation (table) is in 2NF if it meets the criteria for 1NF and if all non-

key attributes are fully functional dependent on the primary key.

Note: Since a partial dependency occurs when a non-key attribute is dependent

on only a part of the (composite) key, the definition of 2NF is sometimes phrased

as, "A table is in 2NF if it is in 1NF and if it has no partial dependencies."

Note also that any table with a primary key that is composed of a single

attribute (column) is automatically in second normal form.

Explanation:

In a table, if attribute B is functionally dependent on A, but is not functionally

dependent on a proper subset of A, then B is considered fully functional

dependent on A. Hence, in a 2NF table, all non-key attributes cannot be

dependent on a subset of the primary key. Note that if the primary key is not a

composite key, all non-key attributes are always fully functional dependent on

the primary key. A table that is in 1st normal form and contains only a single key

as the primary key is automatically in 2nd normal form.




FREE, NOT FOR SALE!

This table has a composite primary key [Customer ID, Store ID]. The non-key

attribute is [Purchase Location]. In this case, [Purchase Location] only depends

on [Store ID], which is only part of the primary key. Therefore, this table does

not satisfy second normal form.

To bring this table to second normal form, we break the table into two tables, and

now we have the following:

What we have done is to remove the partial functional dependency that we

initially had. Now, in the table TABLE_STORE, the column [Purchase Location] is

fully dependent on the primary key of that table, which is [Store ID].

1.8.3: Third Normal Form (3NF)

A database relation (table) is in 3NF if it meets the criteria for 2NF and if it has no

transitive dependencies (i.e. if each non-key attribute in a row does not depend

on the entry in another key column).

Remember: By "transitive functional dependency", we mean we have the

following relationships in the table: A is functionally dependent on B, and B is

functionally dependent on C. In this case, C is transitively dependent on A via B.




FREE, NOT FOR SALE!

In the table able, [Book ID] determines [Genre ID], and [Genre ID] determines

[Genre Type]. Therefore, [Book ID] determines [Genre Type] via [Genre ID] and

we have transitive functional dependency, and this structure does not satisfy

third normal form.

To bring this table to third normal form, we split the table into two as follows:

Now all non-key attributes are fully functional dependent only on the primary

key. In TABLE_BOOK, both [Genre ID] and [Price] are only dependent on [Book

ID]. In TABLE_GENRE, [Genre Type] is only dependent on [Genre ID].

1.8.4: Other Normal Forms

Apart from the 1-3 NF you have learnt about so far, there are a number of other

normal forms which you are advised to investigate and know about. In

particular, the following four additional normal forms are recommended for your

further study:



FREE, NOT FOR SALE!

Boyce-Codd Normal Form (BCNF)

Forth Normal (4NF)

Fifth Normal (5NF)

Domain-key normal form (DKNF)

1.9: Class Exercises: Can You?

The table below contains a number of functional dependency expressions

without interpretations, as well as interpretations without the FD expressions.

Write out the full expression or interpretation in the blank spaces provided:

SN FD EXPRESSION YOUR INTERPRETATION

01 Student_ID → Student_Major

02 Student_ID, CourseCode, Semester → Grade

03 Employee_Number functionally determines Current_Salary

04 Row_ID → Movie_Title, Main_Actor

05 Country_Name is functionally dependent on the Lga_ID

06 Lecture_Room, NumberOfStudents and Lecturer (delivering the lecture) are functionally dependent on the Course_Code and Course_Section

07 Car_Type, Maker_ID → Car_Price

08 Given a value for Car_ID, we can uniquely determine one value for Date_Sold, Price_Sold, Sales_Person, Buyer_ID



FREE, NOT FOR SALE!

1.10: Extra Credit: Can You?

Consider R(empno, ename, deptno) with the following instance of R.

EMPNO ENAME DEPTNO

---------- ---------- ----------

7876 ADAMS 209

7499 TUNDE 301

7698 BUKAR 301

7600 BUKAR 405

7782 EMEKA 100

7902 LAWAL 209

7900 JAMES 301

7566 JAMIL 209



FREE, NOT FOR SALE!

SECTION II: NORMALIZATION BY EXAMPLE

Before we begin, may I acknowledge that the bulk of this example on

normalization was adapted from the following file on the internet:

http://opencourseware.kfupm.edu.sa/colleges/cim/acctmis/mis311/files%5CChapter_5-

_Data_Normalization_Topic_1_-Database_Tables_and_Normalization.pdf

2.1: Case-Study Introduction

To illustrate the normalization

process, we will examine a simple

business application. In this case we

will explore the simplified database

activities of a construction company

that manages several building

projects. Each project has its own

project number, name, employees

assigned to it and so on. Each

employee has an employee number,

name, and job classification such as

engineer or computer technician.

The company charges its clients by

billing the hours spent on each

contract. The hourly billing rate is

dependent on the employee’s

position. Periodically, a report is

generated that contains the

information displayed in Table 2.1.

Table 2.1: A

sample report

layout

http://opencourseware.kfupm.edu.sa/colleges/cim/acctmis/mis311/files%5CChapter_5-_Data_Normalization_Topic_1_-Database_Tables_and_Normalization.pdf

http://opencourseware.kfupm.edu.sa/colleges/cim/acctmis/mis311/files%5CChapter_5-_Data_Normalization_Topic_1_-Database_Tables_and_Normalization.pdf



FREE, NOT FOR SALE!

The Total Charge in Table 2.1 is a derived attribute and, at this point is not

stored in this table. Now, the easiest short-cut to generate the required report

might seem to be to have a table whose contents correspond to the reporting

requirements. (See Fig. 2.1)

Fig. 2.1: A table

in the report

format.

Clearly, the structure of the data set in Fig. 2.1 does not handle data very well for

the following reasons:

1. The project number (PROJ_NUM) is apparently intended to be a primary key,

but it contains nulls.

2. The table entries invite data inconsistencies. For example, the JOB_CLASS value

“Elect.Engineer” might be entered as “Elect.Eng.” in some cases, “El. Eng” or

“EE” in others.

3. The table displays data redundancies. These data redundancies yield the

following anomalies:



FREE, NOT FOR SALE!

a. Update anomalies. Modifying the JOB_CLASS for employee number 105

requires (potentially many alterations, one for each EMP_NUM = 105)

b. Insertion anomalies. Just to complete a row definition, a new employee

must be assigned to a project. If the employee is not yet assigned, a

phantom project must be created to complete the employee data entry.

c. Deletion anomalies. If employee 103 quits, deletions must be made for

every entry in which EMP_NUM = 103. Such deletions will result in loosing

other vital data of project assignments from the database.

NOTE: We note that in spite of these structural deficiencies, the table structure

appears to work; the report is generated with ease. Unfortunately, the report

may yield different results, depending on what data anomaly has occurred.

2.2: Conversion to First Normal Form

Fig. 2.1 contains what is known as

repeating groups. A repeating group

derives its name from the fact that a

group of multiple (related) entries

can exist for any single key attribute

occurrence.

A good relational table must not

contain repeating groups. The

existence of repeating groups

provides evidence that the

RPT_FORMAT table in Fig. 2.1 fails to

meet even the lowest normal form

requirements, thus reflecting data

redundancies.

Normalizing the table structure will

reduce these data redundancies. If

repeating groups do exist, they must

be eliminated by making sure that

each row defines a single entity. In

addition, the dependencies must be

identified to diagnose the normal

form. The normalization process

starts with a simple three-step

procedure.



FREE, NOT FOR SALE!

2.2.1: Step 1: Eliminate the repeating groups

Start by presenting the data in a tabular format, where each cell has a single

value and there are no repeating groups. To eliminate the repeating groups,

eliminate the nulls by making

sure that each repeating group

attribute contains an appropriate

data value. This change converts

the RPT_FORMAT table in Fig. 2.1

to the DATA_ORG_1NF table in

Fig. 2.2.

Fig. 2.2: Data organization: first normal form.

2.2.2: Step 2: Identify the primary key

The layout in Fig. 2.2 represents

much more than a mere cosmetic

change. Even a casual observer will

note that PROJ_NUM is not an adequate

primary key because the project

number does not uniquely identify all

the remaining entity (row) attributes.

For example, the PROJ_NUM value 15

can identify any one of five

employees. To maintain a proper

primary key that will uniquely

identify any attribute value, the new

key must be composed of a

combination of PROJ_NUM and EMP_NUM.

2.2.3: Step 3: Identify all dependencies

Dependencies can be depicted with

the help of a diagram as shown in Fig.

2.3. Because such a diagram depicts

all the dependencies found within a

given table structure, it is known as a

dependency diagram. Dependency

diagrams are very helpful in getting a

bird’s-eye view of all the

relationships among a table’s

attributes, and their use makes it

much less likely that you might

overlook an important dependency.



FREE, NOT FOR SALE!

Fig. 2.3: A dependency diagram: first normal form (1NF).

Notice the following dependency diagram features from Fig. 2.3:

1. The primary key attributes are bold, underlined, and shaded in a different

colour.

2. The arrows above the attributes indicate all desirable dependencies, that is,

dependencies that are based on the primary key. In this case, note that the

entity’s attributes are dependent on the combination of PROJ_NUM and

EMP_NUM.

3. The arrows below the dependency diagram indicate less-desirable

dependencies. Two types of such dependencies exist:

a. Partial dependencies. Dependencies based on only a part of a

composite primary key are called partial dependencies.

b. Transitive dependencies. A transitive dependency is a dependency of

one nonprime attribute on another nonprime attribute. The problem

with transitive dependencies is that they still yield data anomalies.



FREE, NOT FOR SALE!

The first normal form (1NF)

describes the tabular format, shown

in Fig. 2.2, in which three conditions

are satisfied:

All the key attributes are

defined.

There are no repeating groups

in the table.

All attributes are dependent on

the primary key.

All relational tables satisfy the 1NF

requirements. The problem with the

1NF table structure shown in Fig. 2.3

is that it contains partial

dependencies and transitive

dependency.

2.3: Conversion to Second Normal Form

The rule of conversion from INF

format to 2NF format is: eliminate all

partial dependencies from the 1NF

format. The conversion from 1NF to

2NF format is done in two steps:

2.3.1: Step 1: Identify all the key components

Fortunately, the relational database

design can be improved easily by

converting the database into a format

known as the second normal form

(2NF). The 1NF-to-2NF conversion is

simple: Starting with the 1NF format

displayed in Fig. 2.3, you do the

following activity:

Eliminate partial dependencies from

the 1NF format in Fig. 2.3. This step

will result in producing three tables

from the original table shown in Fig.

2.2.

From Fig. 2.3, two partial dependencies exist:

1. PROJ_NAME depends on PROJ_NUM, and

2. EMP_NAME, JOB_CLASS, and CHG_HOUR depend on EMP_NUM.

To eliminate the existing two partial dependencies, write each component on a

separate line, and then write the original (composite) key on the last line.



FREE, NOT FOR SALE!

PROJ_NUM

EMP_NUM

PROJ_NUM EMP_NUM

Each component will become the key in a new table. The original table is now

divided into three tables: PROJECT, EMPLOYEE, and ASSIGN.

2.3.2: Step 2: Identify the dependent attributes

Determine which attributes are dependent on which other attributes. The three

new tables, PROJECT, EMPLOYEE and ASSIGN, are described by:

PROJECT (PROJ_NUM, PROJ_NAME)

EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR)

ASSIGN (PROJ_NUM, EMP_NUM, ASSIGN_HOURS)

The results of steps 1 and 2 are displayed in Fig. 2.4. At this point, most of the

anomalies have been eliminated.

Fig. 2.4: Second normal form (2NF) conversion results.



FREE, NOT FOR SALE!

You will recall that a table is in

second normal form (2NF) if the

following two key conditions are

satisfied:

It is in 1NF

It includes no partial

dependencies; that is, no

attribute is dependent on only

a portion of the primary key.

Because a partial dependency can

exist only if a table’s primary key is

composed of several attributes, a

table whose primary key consists of

only a single attribute is

automatically in 2NF if it is in 1NF.

In the ASSIGN table, the attribute

ASSIGN_HOURS depends on both key

attributes of the composite primary

key EMP_NUM and PROJ_NUM. However,

Fig. 2.4 still shows a transitive

dependency as CHG_HOUR depends on

JOB_CLASS. This transitive

dependency can generate anomalies.

2.4: Conversion to Third Normal Form

The rule of conversion from 2NF

format to 3NF format is: eliminate all

transitive dependencies from the 2NF

format.

The conversion from 2NF to 3NF

format is done in three steps: the

data anomalies created by the

database organization shown in Fig.

2.4 are easily eliminated by

completing the following three steps:

2.4.1: Step 1: Identify each new determinant

For every transitive dependency, write its determinant as a PK for a new table. (A

determinant is any attribute whose value determines other values within a

row.) If you have three different transitive dependencies, you will have three

different determinants. Fig. 2.4 shows only one case of transitive dependency.

Therefore, write the determinant for this transitive dependency:

JOB_CLASS



FREE, NOT FOR SALE!

2.4.2: Step 2: Identify the dependent attributes

Identify the attributes that are dependent on each determinant identified in Step

1 and identify the dependency. In this case, you write: JOB_CLASS → CHG_HOUR

Name the table to reflect its contents and function. In this case, JOB seems

appropriate.

2.4.3: Step 3: Remove the dependent attributes from transitive dependencies

Eliminate all the dependent attributes in the transitive relationship(s) from

each of the tables shown Fig. 2.4 that have such a transitive relationship.

Draw a new dependency diagram to show all the tables defined in Steps 1-

3.

Check the new tables as well as the tables modified in Step 3 to make sure

that each table has a determinant and that no table contains inappropriate

dependencies (partial or transitive).

When Steps 1-3 have been completed, the resulting tables will be shown in Fig.

2.5.

Fig. 2.5: Third normal form (3NF) conversion results.



FREE, NOT FOR SALE!

Recall that a table is in third normal form (3NF) if the following two conditions

are satisfied:

It is in 2NF

It contains no transitive dependencies

2.5: Improving the Design

The table structures are refined to eliminate the troublesome initial partial and

transitive dependencies. Normalization cannot, by itself, be relied on to make

good designs. Instead it is valuable only because its use helps eliminate data

redundancies. Therefore the following changes have been made:

PK Assignment

Naming Conventions

Attribute Atomicity

Adding Attributes

Adding Relationships

Refining PKs

Maintaining Historical Accuracy

Using Derived Attributes

The enhancements are shown in the tables and dependency diagrams in Fig. 2.6.

Fig. 2.6: The completed

database (continued next

page).



FREE, NOT FOR SALE!

Fig. 2.6: The completed database



FREE, NOT FOR SALE!

5.1.6 Limitations on System-Assigned Keys

System-assigned primary key may not prevent confusing entries

Data entries in Table 2.2 are inappropriate because they duplicate existing

records - Yet there has been no violation of either entity integrity or

referential integrity.

Table 2.2: Duplicate entries in the JOB table.

This “multiple duplicate records” problem was created when the JOB_CODE was

added to become the PK. In any case, if JOB_CODE is to be the PK, we still must

ensure the existence of unique values in the JOB_DESCRIPTION through the use of a

unique index.

Although our design meets the vital entity and referential requirements, there are

still some concerns the designer must address. The JOB_CODE attribute was

created and designated to be the JOB table’s primary key to ensure entity

integrity in the JOB table. The DBMS may be used to have the system assign the

PK values. However it is useful to remember that the JOB_CODE does not prevent

us from making the entries in the JOB table shown in Table 2.2.

It is worth repeating that database design often involves trade-offs and the

exercise of professional judgment. In a real-world environment, we must strike a

balance between design integrity and flexibility.



FREE, NOT FOR SALE!

SECTION III: NORMALIZATION EXTRA CREDITS (CAN YOU?)

3.1: Extra Credit 1: Worked Example

Examine the table shown below:

branchNo branchAddress telNos

B001 8 Jefferson Way, Portland, OR 97201 503-555-3618, 503-555-2727, 503-555-6534 B002 City Center Plaza, Seattle, WA 98122 206-555-6756, 206-555-8836 B003 14 – 8th Avenue, New York, NY 10012 212-371-3000 B004 16 – 14th Avenue, Seattle, WA 98128 206-555-3131, 206-555-4112

(a) Why is this table not in 1NF?

(b) Describe and illustrate the process of normalizing the data shown in this

table to third normal form (3NF).

(c) Identify the primary, alternate and foreign keys in your 3NF relations.

Answers:



FREE, NOT FOR SALE!

NOTE: There is an alternative approach to altering the original Branch table –

columns can be added to the original Branch table to hold the individual values

for each telephone number, e.g. telNo1, telNo2 and telNo3.



FREE, NOT FOR SALE!


Given the dependency diagram shown in the following figure, identify and discuss

each of the indicated dependencies:

Answers:

C1 → C2 represents a partial dependency, because C2 depends only on C1,

rather than on the entire primary key composed of C1 and C3.

C4 → C5 represents a transitive dependency, because C5 depends on an

attribute (C4) that is not part of a primary key.

C1, C3 → C2, C4, C5 represents a functional dependency, because C2, C4,

and C5 depend on the primary key composed of C1 and C3.


Given the health report card data below, work out the un-normalized, first, second

and third normal forms for the given data card (see next page, please).



FREE, NOT FOR SALE!

HEALTH HISTORY REPORT

PET ID PET NAME PET TYPE PET AGE OWNER VISIT DATE PROCEDURE

246 ROVER DOG 12 SAM COOK JAN 13/2002 01 - RABIES VACCINATION

MAR 27/2002 10 - EXAMINE and TREAT WOUND

APR 02/2002 05 - HEART WORM TEST

298 SPOT DOG 2 TERRY KIM JAN 21/2002 08 - TETANUS VACCINATION

MAR 10/2002 05 - HEART WORM TEST

341 MORRIS CAT 4 SAM COOK JAN 23/2001 01 - RABIES VACCINATION

JAN 13/2002 01 - RABIES VACCINATION

519 TWEEDY BIRD 2 TERRY KIM APR 30/2002 20 - ANNUAL CHECK UP

APR 30/2002 12 - EYE WASH

Your answers here, please:

UNF:

1NF:

2NF:

3NF:

WORKED ANSWERS:

UNF:

Pet [ pet_id, pet_name, pet_type, pet_age, owner, ( visitdate, procedure_no,

procedure_name ) ]



FREE, NOT FOR SALE!

1NF:

Pet [ pet_id, pet_name, pet_type, pet_age, owner ]

Pet_Visit [ pet_id, visitdate, procedure_no, procedure_name ]

Note that a procedure may occur on multiple dates, therefore visitdate is included as

part of the key

2NF:

Pet [ pet_id, pet_name, pet_type, pet_age, owner ]

Pet_Visit [ pet_id, visitdate, procedure_no ]

Procedure [ procedure_no, procedure_name ]

3NF:

same as 2NF



FREE, NOT FOR SALE!

SECTION IV: USEFUL DATABASE TERMS AND DEFINITIONS

Terms and definitions are very important in any topic. The normalization process

has its share of terms. Here are a few that you should try to understand as you

continue your exploration of the interesting world of database management

systems even outside the scope of this manual:

WORD/TERM DEFINITION

Anomaly A deviation from the common rule, type, arrangement, or form; or an incongruity or inconsistency.

Attribute Describes a column in a table. It comes from relational algebra, where a column is an attribute and a row is a tuple.

Binary relationship A reciprocal set of relations between two things, in databases the two things are tables, views (semi-permanent result sets), or temporary result sets..

Candidate key A unique key that you may choose as a primary key.

Column Describes a vertical element in a table. It comes from spreadsheets, where a column defines the vertical axis of data. A column is a single element of a data structure that is found in every row. Columns always have a value in a data structure when the structure constrains its creation to demand one. Database let you allow or disallow a null value when you create a table (or structure).

Composite key A key that is made up of two or more columns. It is possible that this term can be applied to many different keys, and that it is interchangeable with a compound key. You will see compound key used more frequently.

Compound key A key that is made up of two or more columns. It is possible that this term can be applied to many different keys, and that it is interchangeable with a composite key. This is typically the more widely used word.

Data structure Describes the definition of a type of data, like an integer or string and the collection of a group of various pre-defined data types into a group. The latter is the best corollary to a record in a file system, or a row in a database. You can make a numerically indexed array of any base data type (often described as scalar or primitive), or compound data type, which effectively creates a 2-dimensional structure known as a database table.

Field Describes a column in a table. It comes from file systems, which predate databases. A field is either a positionally fixed or delimited



FREE, NOT FOR SALE!

element in a list of values. Fields are always found but may be null or empty values.

File system Describes the use of files as a data repository, where each file contains rows of data organized as data structures. Procedural programming languages access the files based the programmer’s knowledge of their definitions, which is normally maintained in a definitional file or document.

Foreign key A key that maps to a value in a primary key list, where the list exists in the same or another table.

Functional dependency Means an attribute or column depends on exactly one other unique attribute or set of attributes. The unique attribute is a single column natural key chosen as a primary key, while the unique set of attributes is a compound natural key, likewise chosen as a primary key. You write the functional dependency: A → B Columns that have a mandatory reliance on another column or set of columns are typically a non-key column or set of columns on a primary key found in the current row. A foreign key column is also functionally dependent on a primary key in the current table or other table.

Key A column that contains a value that identifies or helps in conjunction with other key columns to identify a row as unique.

Many-to-many A non-specific relationship between two tables, where one row in one table may map to one to many rows in the other and vice versa. You map these two tables by using a third table that holds a foreign key from both in the same row. The third table is known as an association or translation table. Both of the original tables have a one to many relationship to the association table, and both relationships resolve through the association table.

N-ary relationship A non-specific relationship between three or more tables, where one row in one table may have a many-to-many-to-many relationship between one or both of the other tables. You map these three or more tables by using another table that holds a foreign key from all of them in the same row. The other table is known as an association or translation table. Typically, all of the original tables have a one-to-many relationship to the association table, and all relationships resolve through the association table.

Natural key A unique key that identifies a row of data, or instance of data. A natural key is automatically a candidate key that you may choose as a primary key. All other columns in the table should enjoy a direct and full functional dependency on the natural key. If you adopt a surrogate key for joins, the surrogate key plus the natural key should become a unique index to speed searches through the table.



FREE, NOT FOR SALE!

Nominated key A unique key that you may choose as a primary key, and it is also known as a candidate key. The only subtle difference that I’ve found is that some people use nominated to indicate the candidate key they’ve tentatively chosen before making a final decision.

Non-key A column that contains a descriptive value that doesn’t identify or help identify a row as unique but provides a characteristic to a row of data. All non-key columns should have a full functional dependency on the natural key, or primary key.

Non-specific relationship A logical reciprocal set of relations between two things where no row in either set has a possible intersection with the other. A non-specific relationship is also known as a binary many-to-many relationship. These are logical relationships that convert to two physical relationships known as specific relationships. Specific relationships are either one-to-one or one-to-many binary relationships. Non-specific relationships are resolved by two one-to-many relationships and an association set. The association set holds rows of foreign keys that point respectively to both sets. Each row in the association table lets you resolve the relationship between a row in one and a row in the other through an INNER JOIN.

Object instance An object instance is a data set inserted into a defined object type. This can occur at runtime, or in the context of databases through an INSERT statement. An Oracle database may contain nested object instances when a column relies on an object type, which are known as standalone objects.

Object type An object type in the context of a database is a data structure, or the definition of a table. Definitions of tables are stored in the database catalog and built upon pre-existing data types. Some databases support User-Defined Types (UDT). Where UDTs are available the data structure may use them when they’re defined before the object type. Object types are a generalization of tables user-defined types in an Oracle database.

One-to-many A specific relationship between two table, where one row in one table maps to one to many rows in another table. You map these two tables by using the primary key of one table as a foreign key in the other. This makes the table that holds the foreign key functionally dependent on the primary key in the other table. The one side of the relationship is always the independent row, and it always donates a copy of its primary key to the dependent row.

One-to-one A specific relationship between two table, where one row in one table maps to one and only one row in another table. You map these two tables by using the primary key of one table as a foreign key in the other. This makes the table that holds the foreign key functionally



FREE, NOT FOR SALE!

dependent on the primary key in the other table. While a one-to-one relationship allows you to choose either as the independent row, it is important that you identify the business relationship of the two tables and make the primary task element the independent row. The independent row donates a copy of its primary key to the dependent row.

Partial dependency A partial dependency exists when the primary key is a compound key of two or more columns, and one or more columns depends on less than all of the columns in the compound primary key.

Primary key A candidate key that you chose to serve as the primary key.

Record Describes a horizontal element in a table. A record is a row of data, or an instance of a defined data structure. As such, it is row inside a file.

Row Describes a horizontal element in a table. It comes from spreadsheets, where a row defines the horizontal axis of data. A row is also an instance of the data structure defined by a table definition, and the nested array of a structure inside an ordinary array.

Specific relationship A reciprocal set of relations between two things where one row in a result set finds one row in another result set. Another example is where one row in a result set finds many matches in another result set. These binary relationships are respectively one-to-one and one-to-many. Specified relationships have equijoin or non-equijoin resolution. The first matches values, like the process in a nested loop, and the second matches values through a range or inequality relationship. Equijoins typically have a primary and foreign key, and the one-side holds the primary key while the many side holds a foreign key. In the specialized case of a one-to-one relationship, you must choose which table holds the primary key that becomes a functional dependency as a foreign key in the other.

Superkey A key that identifies a set of rows, like a gender column that lets you identify male or females in your data model.

Surrogate key A key that identifies uniqueness for rows, like an automatic numbering sequence. It is superior solution to a natural key because you create indexes by using the surrogate key column followed by the primary key column(s). If you discover more about the domain later and need to add a column to the natural key, you need only drop the index and recreate with the new list of columns.

Transitive dependency A column that depends on another column before relying on the primary key of the table. It may exist in tables with three or more columns that are in second normal form.

Tuple Describes a row in a table. It comes from relational algebra, where a



FREE, NOT FOR SALE!

column is an attribute and a row is a tuple.

Unique key A column or set of columns that uniquely identifies a row of data.

User-defined type A data type defined by the user in a schema (Oracle) or database (MySQL and Microsoft SQL Server).



FREE, NOT FOR SALE!

WE’RE DONE FOR NOW, GOOD BYE!

ELL, that will be all in this gentle introduction to database

normalization. I hope you found it both useful and enjoyable; even

though we had to skip many of the "noisy" parts that deal with the

bewildering math and equations of normalization. In the next manuals for this

course, we shall learn about database programming with VB6 and you will

eventually build a complete data-driven desktop application to manage students'

records at IACC, ABU Zaria. See you then...

Download more resources at: http://www.auwalgene.com/mystudents/lecturenotes

Connect with me on Facebook at: http://www.facebook.com/auwalgene3

Drop me a comment or two on my website: http://www.auwalgene.com/comments

Tell me something via e-mail at: [email protected]

Send SMS text messages to my mobile phone at +234 (0) 8032126160

Thank you for reading, and happy database normalization!

W


http://www.facebook.com/auwalgene3

http://www.auwalgene.com/comments

mailto:[email protected]

abridged lecture notes for dcs210 - auwal genemystudents/lecturenotes/... · abridged lecture notes...

Documents