data preprocessing. relational databases - normalization denormalization data preprocessing missing...

Data Preprocessing

Relational Databases - Normalization Denormalization

Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values

Remember: Relational Databases Model entities and relationships Entities are the things in the real world

Information about employees and the department they work for Employee and department are entities

Relationships are the links between these entities Employee works for a department

Relation or Tables Relation: a table of data

• table = relation,(set theory, based on predicate logic)

emplyeeID name job departmentID

7513 Nora Programmer 128

9842 Ben DBA 42

6651 Alex Programmer 128

9006 Claudia System-Administrator

128

Columns and Rows

Each column or attribute describes some piece of data that each record in the table has

Each row in a table represents a record Rows, records or tupels

Keys A superkey is a column (or a set of columns) that can

be used to identify a row in as table a key is a minimal superkey

There are different possible keys candidate keys

We chose form the candidate keys the primary key Primary key is used to identify a single row (record) Foreign keys represents links between tables

Keys primary key foreign key


7513 Nora Programmer 128

9842 Ben DBA 42

6651 Alex Programmer 128

9006 Claudia System-Administrator

128

Functional Dependencies

If there is a functional dependency between columns A and B in a given table which may be written

Then the value of column A determines the value of column B

employeeID functionally determines the name

€

A→ B

Schema

Database schema

Structure or design of the database Database without any data in it

employee(employeeID,name,job,departmentID)

Design

Minimize redundancy Redundancy: data is repeated in different

rows

employee(employeeID,name,job,departmentID,departmentName)

emplyeeID name job departmentID departamentName

7513 Nora Programmer 128 Research and Development

9842 Ben DBA 42 Finance

6651 Alex Programmer 128 Research and Development

9006 Claudia System-Administrator 128 Research and Development

Reduce redudancy

employee(employeeID,name,job,departmentID,departmentName)

employee(employeeID,name,job,departmentID)

employee(departmentID,name)

Insert Anomalies

Insert data into flawed table Data does not match what is already in

the table It is not obvious which of the rows in the

database is correct

Deletion Anomalies

Delete data from a flawed schema

When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists

Update Anomalies

Change data in a flawed schema

We do not change the data for every row correctly

Null Values

Avoid schema designs that have large numbers of empty attributes

Normalization

Remove design flaws from a database Normal forms, which are a set of rules

describing what we should and should not do in our table structures

Breaking tables into smaller tables that form a better design

Normal Forms

1 Forma Normal

2 Forma Normal

3 Forma Normal

5 Forma Normal

4 Forma Normal

Forma Normal Boyce Codd

First Normal Form (1NF)

Each attribute or column value must be atomic

Each attribute must contain a single value


skills

7513 Nora Programmer 128 C, Perl, Java

9842 Ben DBA 42 DB2

6651 Alex Programmer 128 VB, Java

9006 Claudia System-Administrator 128 NT, Linux

1NF

emplyeeID name job departmentID skills

7513 Nora Programmer 128 C

7513 Nora Programmer 128 Perl

7513 Nora Programmer 128 java

9842 Ben DBA 42 DB2

6651 Alex Programmer 128 VB

6651 Alex Programmer 128 java

9006 Claudia System-Administrator 128 NT

9006 Claudia System-Administrator 128 Linux

Second Normal Form (2NF)

All attributes that are no part of the primary key are fully dependent on the primary key Each non key attribute must be functionally

dependent on the key Is already in 1NF

2NF ?

employee(employeeID,name,job,departmentID,skill)

emplyeeID name job departmentID skills

7513 Nora Programmer 128 C

7513 Nora Programmer 128 Perl

7513 Nora Programmer 128 java

9842 Ben DBA 42 DB2

6651 Alex Programmer 128 VB

6651 Alex Programmer 128 java

9006 Claudia System-Administrator 128 NT

9006 Claudia System-Administrator 128 Linux

Functional dependencies

employeeID,skill name, job, deparmentID employeeID name, job, deparmentID

Partially functionally dependent on the primary key

Not fully functionally dependent on the primary key

2NF

Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key

Breaking the table into two tables employee(employeeID,name,job,departmentID) employeeSkills(employeeID,skill)

Third Normal Form (3NF)

Remove all transitive dependencies Be in 2NF

employee(employeeID,name,job,departmentID,departmentName) employeeID name,job,departmentID,departmentName departmentID departmentName

employeID name job departmentID departmentName

7513 Nora Programmer 128 Research

9842 Ben DBA 42 Finance

6651 Ajay Programmer 128 Research

9006 Candy SYS 128 Research

Transitive dependency

employeeID departmentName employeeID deparmtentID

departmentID departmentName

3NF

Remove transitive dependency Decompose into multiple tables

emploee(employeeID,name,jop,departmentID) deparment(deparmentID,deparmtentName)

3NF

The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal) Boyce-Codd Normal Form

or

The right side of the functional dependency is a part of any key of the table

BCNF

All attributes must be functionally determined by a superkey

Full normalization means lots of logically seperate relations

Lots of logically separate relations means a lot of physically separate files

Lots of physically separate files means a lot of I/O

Difficulties in finding dimensions for dimensional schema, star schema (dimension tables, fact table)

What is Denormalization? Normalizing a relational variable R means replacing

R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn Reduce redundancy, each projections R1,R2,..,Rn is at

the highest possible value of normalization Denormalizing the relational variables means

replacing them by their join R Increase redundancy, by ensuring that R is a lower level

of normalization than R1,R2,..,Rn Problems

Once we start to denormalize, it is not clear when to stop?

Dimensional Schema Array cells often empty

The more dimensions, there more empty cells Empty cell Missing information How to treat not present information ? How does the system support

• Information is unknown• Has been not captured• Not applicable• ....

Solution?

Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data• e.g., occupation=“ ”

noisy: containing errors or outliers• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names

• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records

Why Is Data Dirty?

Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was

collected and when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

Why Is Data Preprocessing Important?

No quality data, no quality mining results! Quality decisions must be based on quality data

• e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility

Major Tasks in Data Preprocessing Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration Integration of multiple databases, data cubes, or files

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but produces the

same or similar analytical results

Data discretization Part of data reduction but with particular importance, especially

for numerical data

Forms of Data Preprocessing

Data Cleaning

Importance “Data cleaning is one of the three biggest problems in data

warehousing”—Ralph Kimball “Data cleaning is the number one problem in data

warehousing”—DCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

Missing Data

Data is not always available E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

Missing data may be due to equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of

entry

not register history or changes of the data

Missing data may need to be inferred

Missing Values

The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)

null corresponds to UNK for unknown 3VL a mistake?

Boolean Operators

In scalar comparison in which either of the compared is UNK evaluates the unknown truth value

AND t u f

t t u f

u u u f

f f f f

OR t u f

t t t t

u t u u

f t u f

NOT

t f

u u

f t

MAYBE

Another important Boolean operator is MAYBE

MAYBE

t f

u t

f f

Example

Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000

EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND

DOB < DATE (‘1971-1-18’) AND

SALLARY < 40000 )

Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false

EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

Numeric expression WEIGHT * 454

If WEIGHT is UKN, then the result is also UKN

Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK

Anomalies WEIGHT-WEIGHT=UNK (0) WEIGHT/0=UNK (“zero divide”)

UNK is not u (unk) UNK (the value-unknown null) u (unk) (unknown truth value) ...are not the same thing

u is a value, UNK not a value at all!

Suppose X is BOOLEAN Has tree values: t (true),f (false), u ukn X is ukn, X is known to be unk X is UKN, X is not known!

Some 3VL Consequences

The comparison x=x does not give true In 3VL x is not equal to itself it is happens to

be UNK The Boolean expression p OR NOT(p)

does not give necessarily true unk OR NOT (unk) = unk

Example Get all suppliers in Porto and take the union

with get all suppliers not in Porto We do not get all suppliers!

We need to add maybe in Porto In 2 VL p OR NOT(p) corresponds to p OR NOT(p) OR MAYBE(p) in 3VL

While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world

Some 3VL Consequences The expression r JOIN r does not

necessarily give r A=B and B=C together does not imply

A=C .... Many equivalences that are valid in 2VL

break down in 3VL We will get wrong answers

Special Values

Drop the idea of null and UNK,unk 3VL Use special values instead to represent

missing information

Special values are used in the real world In the real world we might use the special

value „?“ to denote hours worked by a certain employee if actual value is unknown

Special Values General Idea:

Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used

The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is

Approach is not very elegant, but without 3VL problems, because it is in 2VL

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in classification—not effective when the percentage of

missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or

decision tree

Relational Databases - Normalization Denormalization

Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values

Next..

Data Preprocessing

Visual inspection Noise Reduction Data Reduction Data Discretization Data Integration

data preprocessing. relational databases - normalization denormalization data preprocessing missing...

Documents

departmentid slide

nf slide

department slide

primary key slide

development slide

skill slide

tupels slide

correct slide