normalization - texas southern...
TRANSCRIPT
Normalization
1
Normalization
We discuss four normal forms: first, second, third, and
Boyce-Codd normal forms
1NF, 2NF, 3NF, and BCNF
Normalization is a process that “improves” a database
design by generating relations that are of higher normal
forms.
The objective of normalization:
“to create relations where every dependency is on the key,
the whole key, and nothing but the key”.
Normalization
2
There is a sequence to normal forms:
1NF is considered the weakest,
2NF is stronger than 1NF,
3NF is stronger than 2NF, and
BCNF is considered the strongest
Also,
any relation that is in BCNF, is in 3NF;
any relation in 3NF is in 2NF; and
any relation in 2NF is in 1NF.
Prime and non-prime Attributes
• In the relational model of databases, a candidate key of a relation is a minimal superkey for that relation.
• The constituent attributes of a candidate key are called prime attributes or key attributes.
• Conversely, an attribute that does not occur in ANY candidate key is called a non-prime attribute or non-key attribute
• 2NF (and 3NF) both involve the concepts of
key and non-key attributes.
•
March 2017 3
Normalization
4
BCNF
3NF
2NF
1NF a relation in BCNF, is also
in 3NF
a relation in 3NF is also in
2NF
a relation in 2NF is also in
1NF
Normalization
5
We consider a relation in BCNF to be fully normalized.
The benefit of higher normal forms is that update semantics for
the affected data are simplified.
This means that applications required to maintain the database
are simpler.
A design that has a lower normal form than another design has
more redundancy. Uncontrolled redundancy can lead to data
integrity problems.
First we introduce the concept of functional dependency
Functional Dependencies
6
Functional Dependencies
We say an attribute, B, has a functional dependency on
another attribute, A, if for any two records, which have
the same value for A, then the values for B in these two
records must be the same. We illustrate this as:
A B
Example: Suppose we keep track of employee email
addresses, and we only track one email address for each
employee. Suppose each employee is identified by their
unique employee number. We say there is a functional
dependency of email address on employee number:
employee number email address
Functional Dependencies
7
EmpNum EmpEmail EmpFname EmpLname
123 [email protected] John Doe
456 [email protected] Peter Smith
555 [email protected] Alan Lee
633 [email protected] Peter Doe
787 [email protected] Alan Lee
If EmpNum is the PK then the FDs:
EmpNum EmpEmail
EmpNum EmpFname
EmpNum EmpLname
must exist.
Functional Dependencies
8
EmpNum EmpEmail
EmpNum EmpFname
EmpNum EmpLname
EmpNum
EmpEmail
EmpFname
EmpLname
EmpNum EmpEmail EmpFname EmpLname
3 different ways
you might see FDs
depicted
Functional Dependency
9
Teacher Course Text
Smith Data Structures Bartram
Smith Data Management Martin
Hall Compilers Hoffman
Brown Data Structures Horowitz
Text Course
Teacher , Course Text
Course is NOT FD on Teacher
Text is NOT FD on Teacher
Full Functional Dependency
The attribute B is fully functionally dependent on the attribute A if each value of A determines one and only one value of B.
• Example: PROJ_NUM PROJ_NAME
In this case, the attribute PROJ_NUM is known as the determinant attribute and the attribute PROJ_NAME is known as the dependent attribute.
10
Determinant
11
Functional Dependency
EmpNum EmpEmail
Attribute on the LHS is known as the determinant
• EmpNum is a determinant of EmpEmail
Attribute A determines attribute B (that is B is functionally
dependent on A): if all of the rows in the table that agree in value
for attribute A also agree in value for attribute B.
Fully functional dependency & composite key
• A composite or multipart key A is a combination of two or more columns in a table that can be used to uniquely identify each row in the table.
• If attribute B is functionally dependent on a composite key A but not on any subset of that composite key, the attribute B is fully functionally dependent on A.
12
Transitive dependency
13
Transitive dependency
Consider attributes A, B, and C, and where
A B and B C.
Functional dependencies are transitive, which means
that we also have the functional dependency A C
We say that C is transitively dependent on A through B.
Transitive dependency
14
EmpNum EmpEmail DeptNum DeptNname
EmpNum EmpEmail DeptNum DeptNname
DeptName is transitively dependent on EmpNum via DeptNum
EmpNum DeptName
EmpNum DeptNum
DeptNum DeptName
Partial dependency
15
A partial dependency exists when an attribute B is
functionally dependent on an attribute A, and A is a
component of a multipart candidate key.
InvNum LineNum Qty InvDate
Candidate keys: {InvNum, LineNum} InvDate is
partially dependent on {InvNum, LineNum} as
InvNum is a determinant of InvDate and InvNum is
part of a candidate key
First Normal Form
16
First Normal Form
We say a relation is in 1NF if all values stored in the
relation are single-valued and atomic.
1NF places restrictions on the structure of relations.
Values must be simple.
First normal form (1NF)
A relation (table) should be “flat”.
• the domain of an attribute must include only atomic (simple, indivisible) values and
• the value of any attribute in a tuple must be a single value from the domain of that attribute.
• The only attribute values permitted by 1NF are single atomic (or indivisible) values.
• (No relation within relation)
•
17
First Normal Form
18
The following in not in 1NF
EmpNum EmpPhone EmpDegrees
123 233-9876
333 233-1231 BA, BSc, PhD
679 233-1231 BSc, MSc
EmpDegrees is a multi-valued field:
employee 679 has two degrees: BSc and MSc
employee 333 has three degrees: BA, BSc, PhD
First Normal Form
19
To obtain 1NF relations we must, without loss of
information, replace the above with two relations -
see next slide
EmpNum EmpPhone EmpDegrees
123 233-9876
333 233-1231 BA, BSc, PhD
679 233-1231 BSc, MSc
First Normal Form
20
EmpNum EmpDegree
333 BA
333 BSc
333 PhD
679 BSc
MSc 679
EmpNum EmpPhone
123 233-9876
333 233-1231
679 233-1231
A join between Employee and EmployeeDegree will produce
the information we saw before
Employee EmployeeDegree
Boyce-Codd Normal Form
21
Boyce-Codd Normal Form
BCNF is defined very simply:
a relation is in BCNF if it is in 1NF and if every
determinant is a candidate key.
Usually BCNF is the target normalization.
Second Normal Form
22
Second Normal Form
A relation is in 2NF if it is in 1NF, and every non-key attribute is
fully dependent on each candidate key. (That is, we don’t have
any partial functional dependency.)
A relation schema R is in 2NF if every nonprime attribute A in R
is fully functionally dependent on the primary key of R.
•Relations that are not in BCNF have data redundancies
•A relation in 2NF will not have any partial dependencies
Second Normal Form
23
LineNum ProdNum Qty InvNum
InvNum, LineNum ProdNum, Qty
InvLine is not 2NF since there is a partial
dependency of InvDate on InvNum
Since there is a determinant that is not a
candidate key, InvLine is not BCNF
InvDate
InvDate InvNum
Qty is a non-key attribute, and it is dependent on InvNum, but
also on ProdNum and LineNum.
InvLine is
only in 1NF
Consider this InvLine table (in 1NF):
Second Normal Form
24
LineNum ProdNum Qty InvNum InvDate
InvLine
The above relation has redundancies: the invoice date is
repeated on each invoice line.
We can improve the database by decomposing the relation
into two relations:
LineNum ProdNum Qty InvNum
InvDate InvNum
Question: What is the highest normal form for these
relations? 2NF? 3NF? BCNF?
25
Is the following relation in 2NF?
inv_no line_no prod_no prod_desc qty
26
EmployeeDept
ename ssn bdate address dnumber dname
Is the following relation in 2NF?
Third Normal Form
27
• A relation R is in 3NF if the relation is in 1NF and all
determinants of non-key attributes are candidate keys
That is, for any functional dependency: X A holds
in R, either (a) X is a candidate key of R, or (b) A is a
prime attribute.
• This definition of 3NF differs from BCNF only in the
specification of non-key attributes - 3NF is weaker
than BCNF. (BCNF requires all determinants to be
candidate keys.)
• A relation in 3NF will not have any transitive
dependencies of non-key attribute on a candidate key
through another non-key attribute.
Third Normal Form
28
EmpNum EmpName DeptNum DeptName
EmpName, DeptNum, and DeptName are non-key attributes.
DeptNum determines DeptName, a non-key attribute, and
DeptNum is not a candidate key.
Consider this Employee relation
Is the relation in 3NF? … no
Is the relation in 2NF? … yes
Is the relation in BCNF? … no
Candidate keys
are? …
Third Normal Form
29
EmpNum EmpName DeptNum DeptName
We correct the situation by decomposing the original relation
into two 3NF relations. Note the decomposition is lossless.
EmpNum EmpName DeptNum DeptName DeptNum
Verify these two relations are in 3NF.
30
student_no course_no instr_no
Instructor teaches one
course only.
Student takes a course
and has one instructor.
In 3NF, but not in BCNF:
{student_no, course_no} instr_no
instr_no course_no
since we have instr_no course-no, but instr_no is not a
candidate key, the relation is not BCNF.
31
course_no instr_no
student_no instr_no
student_no course_no instr_no
{student_no, instr_no} student_no
{student_no, instr_no} instr_no
instr_no course_no
91.2914 32
Is the following relation in 3NF? BCNF?
EmployeeDept
ename ssn bdate address dnumber dname
33
Not 3NF nor BCNF.
since dnumber is not a candidate key and we have:
dnumber dname.
EmployeeDept
ename ssn bdate address dnumber dname
34
LineNum ProdNum Qty InvNum InvDate
InvLine (original)
2NF
LineNum ProdNum Qty InvNum
InvDate InvNum
Question: What is the highest normal form for these
relations? 2NF? 3NF? BCNF?
InvNum ProdNum Qty
Conversion to First Normal Form • A relational table must not contain repeating
groups.
• A repeating group derives its name from the fact that a group of multiple entries of the same type can exist for any single key attribute occurrence.
• If repeating groups do exist, they must be eliminated by making sure that each row defines a single entity.
• 1NF starts with a simple three-step procedure:
35
Step 1: Eliminate the Repeating Groups:
• Represent the data in a tabular format, where each cell has a single value and there are no repeating groups.
• To eliminate repeating groups: eliminate the NULLs by making sure that each repeating group contains appropriate data value.
1NF violation
PROJ_NUM
PROJ_ NAME
EMP_ NUM
EMP_ NAME JOB_CLASS
CHG_HOUR
HOURS
15 Evergreen 103
June E..Arbough
Elect. Engineer 84.5 23.8
101
John G. News
Database Designer 105 19.4
105
Alice K. Johnson
Database Designer 105 35.7
106
William Smithfield
Programmer 35.75 12.6
102
David H. Senior
Systems Analyst 96.75 23.6
18
Amber Wave 114
Annelisse Jones
Application Designer 48.1 24.6
118
James J. Frommer
General Support 18.36 45.3
104
Anne. K. Romoras
Systems Analyst 96.75 32.4
112
Darlene M. Smithson DSS Analyst 45.95 44
22
Rolling Tide 105
Alice K. Johnson
Database Designer 105 64.7
104
Anne. K. Romoras
Systems Analyst 96.75 48.4
113
Delbert K. Joenbrood
Application Designer 48.1 23.6
111
Geoff B. Wabash
Clerical Support 26.87 22
106
William Smithfield
Programmer 35.75 12.8
25 Starflight 107
Maria D. Alonzo
Programmer 35.75 24.6
115
Travis B. Bawangi
Systems Analyst 96.75 45.8
101
John G. News
Database Designer 105 56.3
114
Annelisse Jones
Application Designer 48.1 33.1
108
Ralph B. Washington
Systems Analyst 96.5 23.6
118
James J. Frommer
General Support 18.36 30.5
112
Darlene M. Smithson DSS Analyst 45.95 41.4
36
Step1: Eliminating Repetitions PROJ_NUM PROJ_NAME EMP_NUM EMP_NAME JOB_CLASS CHG_HOUR HOURS
15 Evergreen 103 June E..Arbough Elect. Engineer 84.5 23.8
15 Evergreen 101 John G. News Database Designer 105 19.4
15 Evergreen 105 Alice K. Johnson Database Designer 105 35.7
15 Evergreen 106 William Smithfield Programmer 35.75 12.6
15 Evergreen 102 David H. Senior Systems Analyst 96.75 23.6
18 Amber Wave 114 Annelisse Jones Application Designer 48.1 24.6
18 Amber Wave 118 James J. Frommer General Support 18.36 45.3
18 Amber Wave 104 Anne. K. Romoras Systems Analyst 96.75 32.4
18 Amber Wave 112 Darlene M. Smithson DSS Analyst 45.95 44
22 Rolling Tide 105 Alice K. Johnson Database Designer 105 64.7
22 Rolling Tide 104 Anne. K. Romoras Systems Analyst 96.75 48.4
22 Rolling Tide 113 Delbert K. Joenbrood Application Designer 48.1 23.6
22 Rolling Tide 111 Geoff B. Wabash Clerical Support 26.87 22
22 Rolling Tide 106 William Smithfield Programmer 35.75 12.8
25 Starflight 107 Maria D. Alonzo Programmer 35.75 24.6
25 Starflight 115 Travis B. Bawangi Systems Analyst 96.75 45.8
25 Starflight 101 John G. News Database Designer 105 56.3
25 Starflight 114 Annelisse Jones Application Designer 48.1 33.1
25 Starflight 108 Ralph B. Washington Systems Analyst 96.5 23.6
25 Starflight 118 James J. Frommer General Support 18.36 30.5
25 Starflight 112 Darlene M. Smithson DSS Analyst 45.95 41.4 37
Step 2: Identify the Primary Key: • To have a proper
Primary Key, it should uniquely identify any attribute value.
• PROJ_NUM value 15, identifies any one of 5 employees.
• EMP_NUM can also identify multiple rows, since one employee can work in more than one project.
• In this case, the only primary key possible is a combination of PROJ_NUM and EMP_NUM.
38
PROJECT
(PROJ_NUM ,EMP-NUM ,EMP_NAME, JOB_CLASS, CHG_HOUR HOURS)
Step 3: Identify all dependencies:
• (PROJ_NUM , EMP_NUM) PROJ_NAME, EMP_NAME, JOB_CLASS, CHG_HOUR, HOURS.
• Additional dependencies:
• PROJ_NUM PROJ_NAME
• EMP_NUM EMP_NAME, JOB_CLASS, CHG_HOUR
• JOB_CLASS CHG_HOUR
This dependency exists between two nonprime attributes, which signals a transitive dependency.
39
Conversion to Second Normal Form
• Conversion to 2NF only occurs when the 1NF has a composite primary key.
• If the 1NF table has a single-attribute primary key, then the table is automatically 2NF.
• 2NF starts with a simple two -step procedure:
40
Step 1: Make new tables to Eliminate Partial Dependencies
• For each component of the primary key that
acts as a determinant in a partial dependency, create a new table with a copy of that component as the primary key.
• It is also important that the determinant attribute remains in the original table because they will be the foreign keys that will relate the new tables to the original one.
41
PROJ_NUM EMP_NUM
Step 2: Reassign Corresponding Dependent Attributes
• Determine all attributes that are dependent in
the partial dependencies. These are removed from the original table and placed in the new table with their determinant.
• Any attributes that are dependent in a partial dependency will remain in the original table.
42
PROJ_NUM PROJ_NAME
PROJ_NUM PROJ_NAME
15 Evergreen
18 Amber Wave
22 Rolling Tide
25 Starflight
43
Step1: 2NF EMP_NUM EMP_NAME, JOB_CLASS, CHG_HOUR
May 2012 91.2814 44
EMP_NUM EMP_NAME JOB_CLASS CHG_HOUR 103 June E..Arbough Elect. Engineer 84.5 101 John G. News Database Designer 105 105 Alice K. Johnson Database Designer 105 106 William Smithfield Programmer 35.75 102 David H. Senior Systems Analyst 96.75 114 Annelisse Jones Application Designer 48.1 118 James J. Frommer General Support 18.36 104 Anne. K. Romoras Systems Analyst 96.75 112 Darlene M. Smithson DSS Analyst 45.95 105 Alice K. Johnson Database Designer 105 104 Anne. K. Romoras Systems Analyst 96.75 113 Delbert K. Joenbrood Application Designer 48.1 111 Geoff B. Wabash Clerical Support 26.87 106 William Smithfield Programmer 35.75 107 Maria D. Alonzo Programmer 35.75 115 Travis B. Bawangi Systems Analyst 96.75 101 John G. News Database Designer 105 114 Annelisse Jones Application Designer 48.1 108 Ralph B. Washington Systems Analyst 96.5 118 James J. Frommer General Support 18.36
112 Darlene M. Smithson DSS Analyst 45.95
45
PROJ_NUM EMP_NUM ASSIGN_HOURS 15 103 23.8 15 101 19.4 15 105 35.7 15 106 12.6 15 102 23.6 18 114 24.6 18 118 45.3 18 104 32.4 18 112 44 22 105 64.7 22 104 48.4 22 113 23.6 22 111 22 22 106 12.8 25 107 24.6 25 115 45.8 25 101 56.3 25 114 33.1 25 108 23.6 25 118 30.5 25 112 41.4
ASSIGNMENT(PROJ_NUM, EMP_NUM, ASSIGN_HOURS)
2NF Form
Now, we have 3 tables:
• PROJECT(PROJ_NUM, PROJ_NAME)
• EMPLOYEE(EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOURS)
• ASSIGNMENT(PROJ_NUM, EMP_NUM, ASSIGN_HOURS)
46
Conversion to third Normal Form • Step 1: Make new tables to eliminate transitive
dependencies.
• For every transitive dependency, write a copy of its determinant as a primary key for a new table.
• It is also important that the determinant remains in the original table to serve as a foreign key.
JOB_CLASS CHG_HOUR
47
JOB_CLASS
Step 2: 3NF
• Identify the attributes that are dependent on each determinant and place them in the new tables with their determinant and remove them from their original table.
In our example:
1. Move CHG_HOUR to new table
2. Remove CHG_HOUR from EMPLOYEE
3. Now : EMP_NUMEMP_NAME, JOB_CLASS
48
Step2: 3 NF
JOB_CLASS CHG_HOUR
General Support 18.36
Clerical Support 26.87
Programmer 35.75
DSS Analyst 45.95
Application Designer 48.1
Elect. Engineer 84.5
Systems Analyst 96.5
Systems Analyst 96.75
Database Designer 105
49
BCNF by making JOB_CLASS a key
JOB_CLASS CHG_HOUR Elect. Engineer 84.5 Database Designer 105 Programmer 35.75 Systems Analyst 96.75 Application Designer 48.1 General Support 18.36 DSS Analyst 45.95 Clerical Support 26.87 Mechanical Engineer 67.9 Civil Engineer 55.78 Bio Technician 34.55
50
Step 2: 3NF EMP_NUM EMP_NAME JOB_CLASS
103 June E..Arbough Elect. Engineer
101 John G. News Database Designer
105 Alice K. Johnson Database Designer
106 William Smithfield Programmer
102 David H. Senior Systems Analyst
114 Annelisse Jones Application Designer
118 James J. Frommer General Support
104 Anne. K. Romoras Systems Analyst
112 Darlene M. Smithson DSS Analyst
113 Delbert K. Joenbrood Application Designer
111 Geoff B. Wabash Clerical Support
107 Maria D. Alonzo Programmer
115 Travis B. Bawangi Systems Analyst
108 Ralph B. Washington Systems Analyst
51
BCNF conversion
• So now our design becomes:
• PROJECT(PROJ_NUM, PROJ_NAME)
• EMPLOYEE(EMP_NUM, EMP_NAME, JOB_CLASS)
• JOB(JOB_CLASS, CHG_HOUR)
• ASSIGNMENT(PROJ_NUM, EMP_NUM, ASSIGN_HOURS)
• Performance tip: Introduce a surrogate key to act as a numeric key in the JOB relation.
52
JOB_ID JOB_CLASS CHG_HOUR 500 Elect. Engineer 84.5 501 Database Designer 105 502 Programmer 35.75 503 Systems Analyst 96.75 504 Application Designer 48.1 505 General Support 18.36 506 DSS Analyst 45.95 507 Clerical Support 26.87 508 Mechanical Engineer 67.9 509 Civil Engineer 55.78 510 Bio Technician 34.55
53
After BCNF conversion
• So now our design becomes:
• PROJECT(PROJ_NUM, PROJ_NAME)
• EMPLOYEE(EMP_NUM, EMP_NAME, JOB_ID)
• JOB(JOB_ID, JOB_CLASS, CHG_HOUR)
• ASSIGNMENT(PROJ_NUM, EMP_NUM, ASSIGN_HOURS)
54
Normal Form Test Remedy (Normalization)
First (1NF) Relation should have no multivalued
attributes or nested relations
Form new relations for each
multivalued attribute or nested
relation. Second (2NF) For relations where primary key
contains multiple attributes, no
nonkey attribute should be
functionally dependent on a part of
the primary key.
Decompose and set up a new
relation for each partial key with its
dependent attribute(s).
Make sure to keep a relation with
the original primary key and any
attributes that are fully functionally
dependent on it. Third (3NF) Relation should not have a nonkey
attribute functionally determined by
another nonkey attribute (or by a set
of nonkey attributes). That is, there
should be no transitive dependency of
a nonkey attribute on the primary key
Decompose and setup a relation
that includes the nonkey attribute(s)
that functionally determine(s) other
nonkey attribute(s).
BCNF All non key attributes are functionally
dependent on key attributes
Make all non key attributes as keys
in the new table.
March 2017 55