data preprocessing. relational databases - normalization denormalization data preprocessing missing...

56
Data Preprocessing

Upload: leanna-wiggington

Post on 01-Apr-2015

255 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Data Preprocessing

Page 2: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Relational Databases - Normalization Denormalization

Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values

Page 3: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Remember: Relational Databases Model entities and relationships Entities are the things in the real world

Information about employees and the department they work for Employee and department are entities

Relationships are the links between these entities Employee works for a department

Page 4: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Relation or Tables Relation: a table of data

• table = relation,(set theory, based on predicate logic)

emplyeeID name job departmentID

7513 Nora Programmer 128

9842 Ben DBA 42

6651 Alex Programmer 128

9006 Claudia System-Administrator

128

Page 5: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Columns and Rows

Each column or attribute describes some piece of data that each record in the table has

Each row in a table represents a record Rows, records or tupels

Page 6: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Keys A superkey is a column (or a set of columns) that can

be used to identify a row in as table a key is a minimal superkey

There are different possible keys candidate keys

We chose form the candidate keys the primary key Primary key is used to identify a single row (record) Foreign keys represents links between tables

Page 7: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Keys primary key foreign key

emplyeeID name job departmentID

7513 Nora Programmer 128

9842 Ben DBA 42

6651 Alex Programmer 128

9006 Claudia System-Administrator

128

Page 8: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Functional Dependencies

If there is a functional dependency between columns A and B in a given table which may be written

Then the value of column A determines the value of column B

employeeID functionally determines the name

A→ B

Page 9: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Schema

Database schema

Structure or design of the database Database without any data in it

employee(employeeID,name,job,departmentID)

Page 10: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Design

Minimize redundancy Redundancy: data is repeated in different

rows

employee(employeeID,name,job,departmentID,departmentName)

emplyeeID name job departmentID departamentName

7513 Nora Programmer 128 Research and Development

9842 Ben DBA 42 Finance

6651 Alex Programmer 128 Research and Development

9006 Claudia System-Administrator 128 Research and Development

Page 11: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Reduce redudancy

employee(employeeID,name,job,departmentID,departmentName)

employee(employeeID,name,job,departmentID)

employee(departmentID,name)

Page 12: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Insert Anomalies

Insert data into flawed table Data does not match what is already in

the table It is not obvious which of the rows in the

database is correct

Page 13: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Deletion Anomalies

Delete data from a flawed schema

When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists

Page 14: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Update Anomalies

Change data in a flawed schema

We do not change the data for every row correctly

Page 15: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Null Values

Avoid schema designs that have large numbers of empty attributes

Page 16: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Normalization

Remove design flaws from a database Normal forms, which are a set of rules

describing what we should and should not do in our table structures

Breaking tables into smaller tables that form a better design

Page 17: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Normal Forms

1 Forma Normal

2 Forma Normal

3 Forma Normal

5 Forma Normal

4 Forma Normal

Forma Normal Boyce Codd

Page 18: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

First Normal Form (1NF)

Each attribute or column value must be atomic

Each attribute must contain a single value

Page 19: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

emplyeeID name job departmentID

skills

7513 Nora Programmer 128 C, Perl, Java

9842 Ben DBA 42 DB2

6651 Alex Programmer 128 VB, Java

9006 Claudia System-Administrator 128 NT, Linux

Page 20: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

1NF

emplyeeID name job departmentID skills

7513 Nora Programmer 128 C

7513 Nora Programmer 128 Perl

7513 Nora Programmer 128 java

9842 Ben DBA 42 DB2

6651 Alex Programmer 128 VB

6651 Alex Programmer 128 java

9006 Claudia System-Administrator 128 NT

9006 Claudia System-Administrator 128 Linux

Page 21: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Second Normal Form (2NF)

All attributes that are no part of the primary key are fully dependent on the primary key Each non key attribute must be functionally

dependent on the key Is already in 1NF

Page 22: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

2NF ?

employee(employeeID,name,job,departmentID,skill)

emplyeeID name job departmentID skills

7513 Nora Programmer 128 C

7513 Nora Programmer 128 Perl

7513 Nora Programmer 128 java

9842 Ben DBA 42 DB2

6651 Alex Programmer 128 VB

6651 Alex Programmer 128 java

9006 Claudia System-Administrator 128 NT

9006 Claudia System-Administrator 128 Linux

Page 23: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Functional dependencies

employeeID,skill name, job, deparmentID employeeID name, job, deparmentID

Partially functionally dependent on the primary key

Not fully functionally dependent on the primary key

Page 24: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

2NF

Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key

Breaking the table into two tables employee(employeeID,name,job,departmentID) employeeSkills(employeeID,skill)

Page 25: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Third Normal Form (3NF)

Remove all transitive dependencies Be in 2NF

Page 26: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

employee(employeeID,name,job,departmentID,departmentName) employeeID name,job,departmentID,departmentName departmentID departmentName

employeID name job departmentID departmentName

7513 Nora Programmer 128 Research

9842 Ben DBA 42 Finance

6651 Ajay Programmer 128 Research

9006 Candy SYS 128 Research

Page 27: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Transitive dependency

employeeID departmentName employeeID deparmtentID

departmentID departmentName

Page 28: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

3NF

Remove transitive dependency Decompose into multiple tables

emploee(employeeID,name,jop,departmentID) deparment(deparmentID,deparmtentName)

Page 29: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

3NF

The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal) Boyce-Codd Normal Form

or

The right side of the functional dependency is a part of any key of the table

Page 30: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

BCNF

All attributes must be functionally determined by a superkey

Page 31: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Full normalization means lots of logically seperate relations

Lots of logically separate relations means a lot of physically separate files

Lots of physically separate files means a lot of I/O

Difficulties in finding dimensions for dimensional schema, star schema (dimension tables, fact table)

Page 32: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

What is Denormalization? Normalizing a relational variable R means replacing

R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn Reduce redundancy, each projections R1,R2,..,Rn is at

the highest possible value of normalization Denormalizing the relational variables means

replacing them by their join R Increase redundancy, by ensuring that R is a lower level

of normalization than R1,R2,..,Rn Problems

Once we start to denormalize, it is not clear when to stop?

Page 33: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Dimensional Schema Array cells often empty

The more dimensions, there more empty cells Empty cell Missing information How to treat not present information ? How does the system support

• Information is unknown• Has been not captured• Not applicable• ....

Solution?

Page 34: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data• e.g., occupation=“ ”

noisy: containing errors or outliers• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names

• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records

Page 35: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Why Is Data Dirty?

Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was

collected and when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

Page 36: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Why Is Data Preprocessing Important?

No quality data, no quality mining results! Quality decisions must be based on quality data

• e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Page 37: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility

Page 38: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Major Tasks in Data Preprocessing Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration Integration of multiple databases, data cubes, or files

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but produces the

same or similar analytical results

Data discretization Part of data reduction but with particular importance, especially

for numerical data

Page 39: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Forms of Data Preprocessing

Page 40: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Data Cleaning

Importance “Data cleaning is one of the three biggest problems in data

warehousing”—Ralph Kimball “Data cleaning is the number one problem in data

warehousing”—DCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

Page 41: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Missing Data

Data is not always available E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

Missing data may be due to equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of

entry

not register history or changes of the data

Missing data may need to be inferred

Page 42: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Missing Values

The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)

null corresponds to UNK for unknown 3VL a mistake?

Page 43: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Boolean Operators

In scalar comparison in which either of the compared is UNK evaluates the unknown truth value

AND t u f

t t u f

u u u f

f f f f

OR t u f

t t t t

u t u u

f t u f

NOT

t f

u u

f t

Page 44: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

MAYBE

Another important Boolean operator is MAYBE

MAYBE

t f

u t

f f

Page 45: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Example

Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000

EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND

DOB < DATE (‘1971-1-18’) AND

SALLARY < 40000 )

Page 46: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false

EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

Page 47: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Numeric expression WEIGHT * 454

If WEIGHT is UKN, then the result is also UKN

Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK

Anomalies WEIGHT-WEIGHT=UNK (0) WEIGHT/0=UNK (“zero divide”)

Page 48: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

UNK is not u (unk) UNK (the value-unknown null) u (unk) (unknown truth value) ...are not the same thing

u is a value, UNK not a value at all!

Suppose X is BOOLEAN Has tree values: t (true),f (false), u ukn X is ukn, X is known to be unk X is UKN, X is not known!

Page 49: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Some 3VL Consequences

The comparison x=x does not give true In 3VL x is not equal to itself it is happens to

be UNK The Boolean expression p OR NOT(p)

does not give necessarily true unk OR NOT (unk) = unk

Page 50: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Example Get all suppliers in Porto and take the union

with get all suppliers not in Porto We do not get all suppliers!

We need to add maybe in Porto In 2 VL p OR NOT(p) corresponds to p OR NOT(p) OR MAYBE(p) in 3VL

While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world

Page 51: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Some 3VL Consequences The expression r JOIN r does not

necessarily give r A=B and B=C together does not imply

A=C .... Many equivalences that are valid in 2VL

break down in 3VL We will get wrong answers

Page 52: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Special Values

Drop the idea of null and UNK,unk 3VL Use special values instead to represent

missing information

Special values are used in the real world In the real world we might use the special

value „?“ to denote hours worked by a certain employee if actual value is unknown

Page 53: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Special Values General Idea:

Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used

The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is

Approach is not very elegant, but without 3VL problems, because it is in 2VL

Page 54: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in classification—not effective when the percentage of

missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or

decision tree

Page 55: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Relational Databases - Normalization Denormalization

Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values

Page 56: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems

Next..

Data Preprocessing

Visual inspection Noise Reduction Data Reduction Data Discretization Data Integration