lecture week 12 - george mason universitycarlotta/teaching/infs-614-f08/readings/... · infs614,....

17
INFS614,. Lecture week 12 1 Schema Refinement & Normalization Theory Lecture Week 12 INFS614,. Lecture week 12 2 Normal Forms The first question: Is any refinement needed? Normal forms: If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help. Role of FDs in detecting redundancy: Consider a relation R with 3 attributes, ABC. No ICs hold: There is no redundancy here. Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!

Upload: habao

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

INFS614,. Lecture week 12 1

Schema Refinement & Normalization Theory

Lecture Week 12

INFS614,. Lecture week 12 2

Normal Forms   The first question: Is any refinement needed?   Normal forms:

–  If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.

  Role of FDs in detecting redundancy: –  Consider a relation R with 3 attributes, ABC.

  No ICs hold: There is no redundancy here.   Given A → B: Several tuples could have the same A value,

and if so, they’ll all have the same B value!

INFS614,. Lecture week 12 3

Boyce-Codd Normal Form (BCNF)   Reln R with FDs F is in BCNF if, for each X → A in F+

–  either A ∈X (i.e, A is in X, this FD is called a trivial FD) –  or X is a (super) key for R (i.e., X → R in F+).

  In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints.

  If BCNF: –  No “data” in R can be predicted using FDs alone. Why: –  Because C is a key, we can’t have two different tuples that

agree on the C value

INFS614,. Lecture week 12 4

Decomposition of a Relation Schema   When a relation schema is not in BCNF: decompose.   Suppose that relation R contains attributes A1 ...

An. A decomposition of R consists of replacing R with two or more relations such that: –  Each new relation schema contains a subset of the

attributes of R (and no attributes that do not appear in R), and

–  Every attribute of R appears as an attribute of at least one of the new relations.

  Intuitively, decomposing R means we will store instances of the relation schemas produced by the decomposition, instead of instances of R.

INFS614,. Lecture week 12 5

Decomposition example Original relation

(not stored in DB!)

Decomposition (in the DB)

INFS614,. Lecture week 12 6

Problems with Decompositions   There are three potential problems to consider:

1.  Some queries become more expensive.   e.g., How much did Attishoo earn? (earn = W*H)

2.  Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation!   Fortunately, not in the SNLRWH example.

3.  Checking some dependencies may require joining the instances of the decomposed relations.   Fortunately, not in the SNLRWH example.

  Tradeoff: Must consider these issues vs. redundancy.

INFS614,. Lecture week 12 7

Example of problem 2

INFS614,. Lecture week 12 8

Lossless Join Decompositions   Decomposition of R into R1 and R2 is lossless-

join w.r.t. a set of FDs F if, for every instance r that satisfies F, we have:

  It is always true that

  In general, the other direction does not hold! If it does, the decomposition is lossless-join.

INFS614,. Lecture week 12 9

Example (lossy decomposition)

INFS614,. Lecture week 12 10

Example (lossless join decomposition)

INFS614,. Lecture week 12 11

Lossless Join Decomposition   The decomposition of R into R1 and R2 is

lossless-join wrt F if and only if F+ contains: –  R1 ∩R2 → R1, or –  R1 ∩R2 → R2

  In particular, the decomposition of R into (XY) and (R-Y) is lossless-join if X → Y holds on R –  assume X and Y do not share attributes. –  WHY?

INFS614,. Lecture week 12 12

Decomposition   Definition extended to decomposition into 3

or more relations in a straightforward way. – Repeated lossless-join decompositions of a relation

R provide a lossless-join decomposition of R   E.g.: R decomposed into R1 and R2; R1 further

decomposed into R11 and R12; R11, R12 and R2 is a lossless-join decomposition of R)

  It is essential that all decompositions used to deal with redundancy are lossless! (Avoids Problem (2).)

INFS614,. Lecture week 12 13

Decomposition into BCNF   ConsiderrelationRwithFDsF.IfX → Ain

F+overR,violatesBCNF,i.e.,–  XAareallinR–  AisnotinX–  X → RisnotinF +

  Then:decomposeRintoR-AandXA.   Repeatedapplicationofthisideawillgiveus

acollectionofrelationsthatareinBCNF;losslessjoindecomposition,andguaranteedtoterminate.

INFS614,. Lecture week 12 14

BCNF Decomposition Example   Assume relation schema CSJDPQV

–  key C, JP → C, SD → P, J → S

  To deal with SD → P, decompose into SDP, CSJDQV.

  To deal with J → S, decompose CSJDQV into JS and CJDQV

  A tree representation of the decomposition: CSJDPQV

SDP CSJDQV

JS CJDQV Using SD → P

Using J → S

INFS614,. Lecture week 12 15

BCNF Decomposition

  In general, several dependencies may cause violation of BCNF. The order in which we “deal with” them could lead to very different sets of relations!

INFS614,. Lecture week 12 16

How do we know R is in BCNF?

  If R has only two attributes, then it is in BCNF   If F only uses attributes in R, then:

– R is in BCNF if and only if for each X → Y in F (not F+!), X is a superkey of R, i.e., X → R is in F+ (not F!).

  In general (F may use attributes outside of R! See example earlier for CSJDQV) , – Need to consider all FD X → A in F+ (not F!).

INFS614,. Lecture week 12 17

Testing for BCNF   To check if a non-trivial dependency X→ A causes

a violation of BCNF 1. compute X+ (the attribute closure of X), and 2.  verify that it includes all attributes of R, that is, X is a

superkey of R.

  Simplified test: To check if a relation schema R with a given set of functional dependencies F is in BCNF, it suffices to check only the dependencies in the given set F for violation of BCNF, rather than checking all dependencies in F+. –  It can be shown that if none of the dependencies in F

causes a violation of BCNF, then none of the dependencies in F+ will cause a violation of BCNF either.

INFS614,. Lecture week 12 18

Testing for BCNF   However, using only F is incorrect when

testing a relation in a decomposition of R (We need to consider all FD X → A in F+)

– E.g. Consider R (A, B, C, D), with F = { A →B, B →C}

  Decompose R into R1(A,B) and R2(A,C,D)   Neither of the dependencies in F contains only attributes

from (A,C,D) so we might be mislead in thinking R2 satisfies BCNF -- WHICH IS NOT TRUE.

  In fact: dependency A → C in F+ shows R2 is not in BCNF.

INFS614,. Lecture week 12 19

Dependency Preserving Decomposition

  Consider CSJDPQV, C is key, JP → C and SD → P. –  BCNF decomposition: CSJDQV and SDP –  Problem: Checking JP → C requires a join!

  Dependency preserving decomposition (Intuitive): –  If R is decomposed into X, Y and Z, and we

enforce the FDs that hold on X, on Y and on Z, then all FDs that were given to hold on R must also hold. (Avoids Problem (3).)

INFS614,. Lecture week 12 20

BCNF and Dependency Preservation

 In general, there may not be a dependency preserving decomposition into BCNF.

INFS614,. Lecture week 12 21

What FD on a decomposition?   Projection of set of FDs F:

If R is decomposed into X, ... the projection of F onto X (denoted FX ) is the set of FDs U → V in F+ (closure of F ) such that U, V are in X.

INFS614,. Lecture week 12 22

Dependency Preserving Decompositions (Contd.)

  Decomposition of R into X and Y is dependency preserving if (FX union FY ) + = F +

–  i.e., if we consider only dependencies in the closure F + that can be checked in X without considering Y, and in Y without considering X, these imply all dependencies in F +.

  Important to consider F +, not F, in this definition: –  ABC, A → B, B → C, C → A, decomposed into AB and BC. –  Is this dependency preserving? Is C → A preserved??

  Dependency preserving does not imply lossless join: –  ABC, A → B, decomposed into AB and BC.

  And vice-versa! (Example?)

INFS614,. Lecture week 12 23

Another example   Assume CSJDPQV is decomposed into

SDP, JS, CJDQV is not dependency preserving w.r.t. the FDs: JP → C, SD → P and J → S.

  However, it is a lossless join decomposition.   In this case, adding JPC to the collection

of relations gives us a dependency preserving decomposition.

  JPC tuples stored only for checking FD!

INFS614,. Lecture week 12 24

Third Normal Form (3NF)

  Reln R with FDs F is in 3NF if, for all X → A in F+

–  A in X (i.e., FD is trivial), or –  X contains a key for R, or –  A is part of some (candidate) key for R.

  Minimality of a (candidate) key is crucial in third condition above!

INFS614,. Lecture week 12 25

Third Normal Form (3NF)

  If R is in BCNF, obviously in 3NF.   If R is in 3NF, some redundancy is possible.

It is a compromise, used when BCNF not achievable (e.g., no “good” decomposition, or performance considerations). –  Lossless-join, dependency-preserving

decomposition of R into a collection of 3NF relations always possible.

INFS614,. Lecture week 12 26

What Does 3NF Achieve?

  If 3NF is violated by X→ A, one of the following holds: –  X is a proper subset of some key K

  We store (X, A) pairs redundantly. E.g.: SBDC, SBD is the only key, S → C. Pairs (S,C) redundant.

–  X is not a proper subset of any key.   There is a chain of FDs K → X → A, which means that we cannot

associate an X value with a K value unless we also associate an A value with an X value.

E.g.: Hourly_Emps SNLRWH, S is the only key, R → W.   But: even if reln is in 3NF, these problems could arise.

–  e.g., Reserves SBDC, S → C, C → S is in 3NF, but for each reservation of sailor S, same (S, C) pair is stored.

  Thus, 3NF is indeed a compromise with respect to BCNF.

INFS614,. Lecture week 12 27

Testing for 3NF   Optimization: Need to check only FDs in F, need

not check all FDs in F+.

  Use attribute closure to check, for each dependency X → A, if X is a superkey.

  If X is not a superkey, we have to verify if A is part of some candidate key of R –  this test is rather expensive, since it involves finding

candidate keys

INFS614,. Lecture week 12 28

Decomposition into 3NF   Obviously, the algorithm for lossless join decomp

into BCNF can be used to obtain a lossless join decomp into 3NF (typically, can stop earlier).

  To ensure dependency preservation, one idea: –  If X → Y is not preserved, add relation XY. –  Problem is that XY may violate 3NF!

  Refinement: Instead of the given set of FDs F, use a minimal cover for F.

INFS614,. Lecture week 12 29

Minimal Cover for a Set of FDs   Minimal cover G for a set of FDs F:

–  Closure of F = closure of G. –  Right hand side of each FD in G is a single attribute. –  If we modify G by deleting an FD or by deleting

attributes from an FD in G, the closure changes.   Intuitively, every FD in G is needed and “as small

as possible’’ in order to get the same closure as F.   e.g., A → B, ABCD → E, EF → GH, ACDF → EG

has the following minimal cover: –  A → B, ACD → E, EF → G and EF → H

INFS614,. Lecture week 12 30

Dependency-Preserving Decomposition into 3NF

  Given R with a set F of FDs that is a minimal cover. Let R1,…,Rn be a lossless-join decomposition of R in 3NF. Let Fi be the projection of F onto Ri. Do: –  Identify the set N of dependencies in F that is not

preserved, i.e. not included in the closure of the union of F1,…Fn;

–  For each FD X → A in N, create a relation schema XA and add it to the decomposition of R.

  The resulting decomposition is a lossless-join and dependency-preserving decomposition of R into 3NF relations.

INFS614,. Lecture week 12 31

Problem

  Given a relation schema R(ABCDE) with FD’s F = {A → B, C→ D, B → AE} Find all the candidate keys for R.

INFS614,. Lecture week 12 32

Problem (contd.) Proceed as follows:

(a) Identify the attributes that are in the relation schema and not in the set of functional dependencies: None.

(b) Identify the attributes that appear only on the left hand-side in F (these attributes will belong to any single key of the relation schema): C.

(c) Identify the attributes that appear only on the right hand-side in F (these attributes are not part of any key): DE.

(d) Combine the set of attributes obtained in (a) and (b) with the attributes that are not obtained in (c). Compute their closure to check whether they are a key.

INFS614,. Lecture week 12 33

Problem (contd.) •  From (a) and (b), we get C;

•  The attributes not obtained in (c) are: A,B,C.

•  We compute the closure of C: C+ = CD. Thus: C is not a key.

•  We compute the closure of CA: CA+ = CABDE. Thus: CA is a key.

•  We compute the closure of CB: CB+ = CBDAE. Thus: CB is a key.

•  Note: We know from (c) that CD and CE are not keys;

•  Then, we consider combinations of three attributes that don’t contain CA or CB (since they are keys), and include C. The only possible combination is CDE, but D and E are not part of any key.

•  Therefore, CA and CB are the only keys in R.

INFS614,. Lecture week 12 34

Summary of Schema Refinement   If a relation is in BCNF, it is free of redundancies that can

be detected using FDs. Thus, trying to ensure that all relations are in BCNF is a good heuristic.

  If a relation is not in BCNF, we can try to decompose it into a collection of BCNF relations. –  Must consider whether all FDs are preserved. If a lossless-join,

dependency preserving decomposition into BCNF is not possible (or unsuitable, given typical queries), should consider decomposition into 3NF.

–  Decompositions should be carried out and/or re-examined while keeping performance requirements in mind.