Naeem A. Mahoto
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: [email protected]
Data warehouse and Data Mining
Lecture No. 12
Normalization and De-normalization
Database Design • Conceptual
– identify important entities and relationships – determine attribute domains and candidate keys
• Logical – Split data into multiple tables, such that:
• no information is lost • useful information can be easily reconstituted
– draw the E-R diagram – validate model using normalization
• Physical – implement on DBMS
Database Anomalies • Database anomalies are unmatched or missing
information caused by limitations or flaws within a given database
• Database anomalies are the problems in relations that occur due to redundancy in the relations
• These anomalies affect the process of inserting, deleting and modifying data in the relations/tables
Types of Anomalies • Insertion Anomaly: It occurs when a new record is inserted
in the relation – In this anomaly, the user cannot insert a fact about an entity
until he/she has an additional fact about another entity • Deletion Anomaly: It occurs when a record is deleted from
the relation – In this anomaly, the deletion of facts about an entity
automatically deleted the fact of another entity • Modification Anomaly: It occurs when the record is updated
in the relation. – In this anomaly, the modification in the value of specific attribute
requires modification in all records in which that value occurs
Normalization • Normalization is the process of converting bad database
design into a form that overcomes database anomalies • It is the process of organizing the fields and tables of a
relational database to minimize redundancy (eliminate redundant data) and dependency (ensure dependency make sense)
• Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them
• The goal is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then the database is updated using the defined relationships
Normalization • Edgar F. Codd (inventor of relational model)
proposed (in 1970) normalization through several normal forms: – First normal form (1NF) – Second normal form (2NF) – Third normal form (3NF) – Boyce-Codd normal form (BCNF) – Fourth normal form (4NF) – Fifth normal form (5NF) – Domain key normal form (DKNF)
First Normal Form (1NF) • A relation/table is in first normal form if the domain
of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain
• Example: Consider a table that stores Customers and their Telephone Number. A customer may have more than one Telephone number
First Normal Form (1NF) Tables designed with 1NF
Second Normal Form (2NF) • A relation/table is in 2NF if and only if it is in 1NF
and every non-prime attribute of the table is dependent on the whole of a candidate key
• A table/relation is in 2NF if it is in first normal form and every non-primary-key column is fully functional dependent on the primary key
• Full functional dependency indicates that if A and B are columns of a table, B is fully dependent on A
Second Normal Form (2NF) • Consider a table describing employees' skills:
Candidate Key is composite {Employee, Skill} - Employee might need to appear more than once (he/she might have multiple Skills) - Current Work Location, is dependent on only part of the candidate key - Therefore the table is not in 2NF
A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}
Progressing to 2NF • If a table is not in second normal form:
– Move that data item and the part of the primary key on which it is functionally dependent to a new table
– Add any other data items are functionally dependent on the same part of the key
– Make the partial primary key the primary key for the new table
Second Normal Form (2NF) A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}
Third Normal Form (3NF) • A table is in 3NF if and only if both of the
following conditions hold: – The relation R (table) is in second normal form (2NF) – Every non-prime attribute of R is non-transitively
dependent (i.e. directly dependent) on every superkey of R
• A table that is in 1NF and 2NF and in which no non-primary-key column is transitively dependent on the primary key
Third Normal Form (3NF) • Example: consider a table with A, B, and C. If B
is functional dependent on A and C is functional dependent on B, then C is transitively dependent on A via B (provided that A is not functionally dependent on B or C)
Third Normal Form (3NF) • 2NF table that fails to meet the requirements of
3NF is: Candidate key (composite key)
Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner
Progressing to 3NF • Move all items involved in transitive
dependencies to a new entity
• Identify a primary key for the new entity
• Place the primary key for the new entity as a foreign key on the original entity
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
• It is a slightly stronger version of the third normal form (3NF)
• A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold: – X → Y is a trivial functional dependency (Y ⊆ X) – X is a superkey for schema R
• Only in rare cases does a 3NF table not meet the requirements of BCNF
Fourth Normal Form (4NF) • A table is in fourth normal form (4NF) if it is in 3NF and
there are no multi-valued dependencies • Multi-valued Dependency: In a table with columns A, B,
and C, there is a multivalued dependence of column B on column A, if each value for A is associated with a specific collection of values for B and, furthermore, this collection is independent of any values for C – E.g. (employee, skill, language), Two many-to-many
relationships that are independent because any skill can be paired with any language
• To remove multi-valued dependencies, create separate tables for the independent repeating groups
De-normalization • De-normalization is the process of combining
tables in a careful manner to improve performance
• This is the process of breaking the rules for 3NF • The primary reasons to do this are:
– To reduce the no. of joins that must be processed in queries, thereby improving database performance
– To map the physical database structure more closely to user’s dimensional business model, structuring tables along the lines of how users will ask questions
De-normalization • Normalization is a rule of thumb in DBMS, but in Decision
Support System (DSS) ease of use is achieved by way of de-normalization
• It brings "close" dispersed but related data items • Query performance in DSS significantly dependent on
physical data model • De-normalization specifically improves performance by either:
– Reducing the number of tables and hence the reliance on joins, which consequently speeds up performance
– Reducing the number of joins required during query execution, or – Reducing the number of rows (records) to be retrieved from the
Primary Data Table
Normalization vs. De-normalization
De-normalization • “Depending on whether the modeler is building
the model for a data mart or a data warehouse the data modeler will wish to engage in some degree of de-normalization”. [Bill Inmon]
• De-normalization of the logical data model serves the purpose of making the data more efficient to access. In the case of a data mart, a high degree of de-normalization can be practiced. In the case of a data warehouse a low degree of de-normalization is in order.” [Bill Inmon]
Issues to consider in De-normalization
• The effects of de-normalization on database performance are unpredictable: as many applications can be affected negatively by de-normalization
• De-normalize the implementation of the logical model only after one has thoroughly analyzed the costs and benefits, and only after a normalized logical design has been completed
De-normalization: Effects • Consider the following list of effects of de-
normalization before one decides to undertake design changes: – A de-normalized physical implementation can
increase hardware costs – While de-normalization benefits the applications it is
specifically designed to enhance, it often decreases the performance of other applications
– De-normalization introduces update anomalies to the database
De-normalization • The following items are typical of the de-
normalizations that can sometimes be exploited to optimize performance: – Pre-join – Column Replication or Movement – Pre-Aggregation
Pre-join: De-normalization • A pre-join de-normalization moves frequently
joined attributes to the same base relation in order to eliminate join processing
• It avoids performance impact of the frequent joins
• Typically increases storage requirements
Pre-join: De-normalization • Before de-normalization:
sale_id store_id sale_dt …
tx_id sale_id item_id … item_qty sale$
1
m
select sum(sales_detail.sale_amt)!from sales ,sales_detail!where sales.sales_id = sales_detail.sales_id! and sales.sales_dt between '2006-11-26' and '2006-12-25' ;!
Pre-join: De-normalization • After de-normalization:
t x _ i d sale_id store_id sale_dt item_id … item_qty $
select sum(d_sales_detail.sale_amt)!from d_sales_detail!where d_sales_detail.sales_dt between '2006-11-26' and '2006-12-25';!
Column Replication: De-normalization
• Take columns that are frequently accessed via large-scale joins and replicate (or move) them into detail table(s) to avoid join operation
• It avoids performance impact of the frequent joins
• It increases storage requirements for database
Column Replication: De-normalization A three table join requires re-distribution of significant amounts of data to answer many important questions related to customer transaction behavior Before de-normalization:
After de-normalization:
Tx_Id Account_Id Tx$ Tx_Dt Location_Id …
Account_Id Customer_Id Balance $ Open_Dt …
Tx_Id Account_Id Tx$ Tx_Dt Location_Id …
1 m
1 m
Customer_Id Customer_Nm Address SIC …
Account_Id Customer_Id Balance $ Open_Dt …
Tx_Id Account_Id Customer_Id Tx$ Tx_Dt Location_Id …
1 m
1 m 1
m
Pre-aggregation: De-normalization
• Take aggregate values that are frequently used in decision-making and pre-compute them into physical tables in the database
• It can provide huge performance advantage in avoiding frequent aggregation of detailed data
• Pre-aggregation adds significant burden to maintenance for Data Warehouse