fit1004 database topic 6: normalisation - pravin … · fit1004 database topic 6: normalisation ......
TRANSCRIPT
www.infotech.monash.edu.au/FIT1004/
FIT1004 DatabaseTopic 6: Normalisation
Learning Objectives:• Understand the purpose of normalisation• Understand the problems associated with redundant data• Identify various types of update anomalies such as insertion, deletion, and
modification anomalies• Recognise the appropriateness or quality of the design of relations• Identify various types of functional dependencies between attributes• Understand how functional dependencies can be used to group attributes into
relations that are in a known normal form• Identify the most commonly used normal forms, namely 1NF, 2NF and 3NF• Perform normalisation• Understand various ways to refine 3NF relations to achieve better database design• Produce an ER diagram from the derived set of 3NF relationsReferences:• Rob, P. & Coronel, C., Database Systems, 6th Edition, Chapt. 5, p. 182 – 221, 7th
Edition, Chapt. 5, p. 147 – 174
2
Where are we?
Introduction to Database Systems The Relational Model
Conceptual Design Logical Design Normalisation
Database Lifecycle Physical Design
SQL (DML) SQL (DDL & DCL) Implementation Transaction Management
Database Administration
Data Warehousing & Data Mining
3
Normalisation
• Normalisation is a technique for producing a set of relations with desirable properties, given the data requirements of an enterprise:
– Developed by E.F. Codd (1972)– Often performed as a series of tests on a relation to determine whether it satisfies
or violates the requirements of a given normal form
• Four most commonly used normal forms are: – First (1NF), – Second (2NF), – Third (3NF) – normally sufficient point, and – Boyce-Codd (BCNF)– 4NF, …. etc (required by some very specialised applications)
• Based on functional dependencies among the attributes of a relation
• Major aim of relational database design is to group attributes into relations to minimise data redundancy and reduce file storage space required by base relations
4
Why Normalisation is required
Note * signifies Project Leader
5
Problems with table in Figure 5.1
• PROJ_NUM intended to be primary key, but it contains nulls• JOB_CLASS invites entry errors eg. Elec. Eng. vs Elect. Engineer vs
E.E.• Project relation has redundant data
– details of a charge per hour are repeated for every occurrence of job class– Every time an employee is assigned to a project emp name repeated
• Relations that contain redundant information may potentially suffer from update anomalies
– Types of update anomalies include:> Insertion
– Insert a new employee only if they are assigned to a project> Deletion
– Delete the last employee assigned to a project?– Delete the last employee of a particular job class?
> Modification– Update a job class hourly rate - need to update multiple rows
6
Functional Dependence
• An attribute B is FUNCTIONALLY DEPENDENT on another attribute A, if a value of A determines a single value of B at any one time.
– A B– EMP# EMP_NAME– CUSTNUMB CUSTNAME– ORDER-NUMBER ORDER-DATE
> ORDER-NUMBER - independent variable, also know as DETERMINANT> ORDER-DATE - dependant variable
• TOTAL DEPENDENCY – attribute A determines B AND attribute B determines A
> EMPLOYEE-NUMBER TAX-FILE-NUMBER
7
Functional Dependence
• FULL DEPENDENCY – occurs when an attribute is always dependant on AT LEAST TWO other attributes– ORDER-NUMBER, PART-NUMBER QTY-ORDERED– lack of full dependence for multiple attribute key = partial dependence
• TRANSITIVE DEPENDENCY– occurs when Y depends on X, and Z depends on Y - thus Z also depends on X
> X Y Z– INVOICE-NUMB CUSTOMER-NUMB CUSTOMER-NAME
• Dependencies are depicted with the help of a Dependency Diagram• NORMALISATION - SIMPLY 'COMMON SENSE'• Converts a table into tables of progressively smaller degree and
cardinality until an optimum level of decomposition is reached -little or no data redundancy exists
8
First Normal Form
• Positive results from normalisation -– amount of space needed to store data will be lower– table can be updated with greater efficiency– description of database will be straightforward
• Unnormalised form (UNF) – raw data from table/form/grid• UNF: PROJECT (proj_num, proj_name (emp_num, emp_name, ….))
– Figure 5.1 consists of a set of projects with each project having a set of project-employee details (model 1)
• FIRST NORMAL FORM (part of formal definition of relation)– A TABLE IS IN FIRST NORMAL FORM (1NF) IF -
> it is a valid table (in particular no repeating groups)> a unique key has been identified for each row> all attributes are functionally dependant on all or part of the key
– 1NF: PROJECT (proj_num, proj_name)– 1NF: ASSIGN (proj_num, emp_num, emp_name, job_class,
chg_hour, assign_hours)
9
UNF to 1NF transformation
• Identify the repeating group(s), if any, in the unnormalised relation• Move from UNF to 1NF by removing repeating group along with the PK
of the main relation• Important property of normalisation decomposition
– Lossless-join property enables us to find any instance of the original relation from corresponding instances in the smaller relations
– hence must extract PK of main relation• Determine PK of new relations created
– extracted repeating group will normally have a composite PK including the main relations PK
> but NOT always, PK of main relation may simply act as a FK– INSURED (comp_code, comp_name (insured_id,
insured_name, ..))» COMPANY (comp_code, comp_name)» INSURED (insured_id, comp_code ,insured_name, ..)
10
First Normal Form continued• An alternative way (model 2) of looking at this scenario
– Present data in tabular format, where each cell has single value and there are no repeating groups
– Eliminate repeating groups, eliminate nulls by making sure that each repeating group attribute contains an appropriate data value
11
Model 2: Dependency Diagram (1NF)
12
1NF to 2NF
• A RELATION IS IN 2NF IF -– all non key attributes are functionally dependent on the entire key
> ie. no partial dependencies exist• Model 1:• Move from 1NF to 2NF by removing partial dependencies
– 1NF: PROJECT (proj_num, proj_name)– 1NF: ASSIGN (proj_num, emp_num, emp_name, job_class,
chg_hour, assign_hours)• 1NF: PROJECT (proj_num, proj_name)
– already in 2NF only one attribute in PK thus CANNOT be any partial dependencies
> 2NF: PROJECT (proj_num, proj_name)• 1NF: ASSIGN (proj_num, emp_num, emp_name, job_class,
chg_hour, assign_hours)– becomes
> 2NF EMPLOYEE (emp_num, emp_name, job_class, chg_hour)> 2NF ASSIGN (proj_num, emp_num, assign_hours)
13
2NF Conversion Results (Model 1 & 2)
Note Model 1 & 2 now equivalent
14
2NF to 3NF
• A RELATION IS IN 3NF IF -– all transitive dependencies have been removed - check for non key
attribute dependant on another non key attribute• Move from 2NF to 3NF by removing transitive dependencies
– 2NF: PROJECT (proj_num, proj_name)– 2NF EMPLOYEE (emp_num, emp_name, job_class, chg_hour)– 2NF ASSIGN (proj_num, emp_num, assign_hours)
• PROJECT and ASSIGN already in 3NF– 3NF: PROJECT (proj_num, proj_name)– 3NF ASSIGN (proj_num, emp_num, assign_hours)
• 2NF EMPLOYEE (emp_num, emp_name, job_class, chg_hour)– 3NF EMPLOYEE (emp_num, emp_name, job_class)– 3NF JOB (job_class, chg_hour)
15
3NF Conversion Results
16
Improving the Design
• To improve the design of the database the following changes could be made: – PK assignment– Naming conventions– Attribute atomicity– Adding attributes– Adding relationships– Refining PKs– Maintaining historical accuracy– Using derived attributes
17
Improving the Design continued
• Returning to Table 5.1 (slide 4)– Data loss – who is the project leader?
> modify project (R&C approach)– 3NF: PROJECT (proj_num, proj_name, emp_num)
> Alternative, add emp_num at UNF> Do not use synonyms when naming attributes – always use the
same name for the same attribute eg. Do not make emp_num in PROJECT leader_num
– JOB (job_class, chg_hour)> Job_class is a string eg. Systems Analyst
– Redundant data with associated issues, poor PK– Better to create job code
> modify job (R&C approach)– 3NF JOB (job_code, job_description, job_chg_hour)
> Alternative, make changes at UNF
18
Completed Database
19
Completed Database continued
20
Entire Process UNF to 3NF• UNF
– PROJECT (proj_num, proj_name, emp_num (emp_num, emp_name, job_code, job_description, job_chg_hour, assign_hours))
• 1NF – remove repeating group and identify PK– PROJECT (proj_num, proj_name, emp_num)– ASSIGN (proj_num, emp_num, emp_name, job_code, job_description,
job_chg_hour, assign_hours)• 2NF – remove partial dependencies
– PROJECT (proj_num, proj_name, emp_num)– EMPLOYEE (emp_num, emp_name, job_code, job_description,
job_chg_hour)– ASSIGN (proj_num, emp_num, assign_hours)
• 3NF – remove transitive dependencies– PROJECT (proj_num, proj_name, emp_num)– EMPLOYEE (emp_num, emp_name, job_code)– ASSIGN (proj_num, emp_num, assign_hours)– JOB (job_code, job_description, job_chg_hour)
• Note R&C show some further 'suggested' improvements
21
Normalisation presented as a Conceptual ERD
22
Normalisation presented as a Logical ERD
23
Normalisation and Database Design
• Normalisation should be part of design process• Make sure that proposed entities meet required normal form before
table structures are created• ER diagram
– Provides the big picture, or macro view, of an organization’s data requirements and operations
– Created through an iterative process
> Identifying relevant entities, their attributes and their relationship
> Use results to identify additional entities and attributes• normalisation procedures
– Focus on the characteristics of specific entities– A micro view of the entities within the ER diagram
• Difficult to separate normalisation process from ER modeling process
• Two techniques should be used concurrently
24
Normalisation and ER DiagramsNormalisation and ER Diagrams
• Top down approach• Fast• Examine requirements • Business knowledge
• Bottom up approach• Very slow• Examine existing data• Mathematically based
NormalisationER Diagramming
• Top down create - bottom up checking• Accuracy• Greater understanding of the data
25
Summary
• This lecture– Understand the purpose of normalisation– Understand the problems associated with redundant data– Identify various types of update anomalies such as insertion,
deletion, and modification anomalies– Recognise the appropriateness or quality of the design of relations– Identify various types of functional dependencies between
attributes– Understand how functional dependencies can be used to group
attributes into relations that are in a known normal form– Identify the most commonly used normal forms, namely 1NF, 2NF
and 3NF– Perform normalisation– Understand various ways to refine 3NF relations to achieve better
database design– Produce an ER diagram from the derived set of 3NF relations
• Next lecture– Structured Query Language (SQL) - DML