data preprocessing. relational databases - normalization denormalization data preprocessing missing...
TRANSCRIPT
Data Preprocessing
Relational Databases - Normalization Denormalization
Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values
Remember: Relational Databases Model entities and relationships Entities are the things in the real world
Information about employees and the department they work for Employee and department are entities
Relationships are the links between these entities Employee works for a department
Relation or Tables Relation: a table of data
• table = relation,(set theory, based on predicate logic)
emplyeeID name job departmentID
7513 Nora Programmer 128
9842 Ben DBA 42
6651 Alex Programmer 128
9006 Claudia System-Administrator
128
Columns and Rows
Each column or attribute describes some piece of data that each record in the table has
Each row in a table represents a record Rows, records or tupels
Keys A superkey is a column (or a set of columns) that can
be used to identify a row in as table a key is a minimal superkey
There are different possible keys candidate keys
We chose form the candidate keys the primary key Primary key is used to identify a single row (record) Foreign keys represents links between tables
Keys primary key foreign key
emplyeeID name job departmentID
7513 Nora Programmer 128
9842 Ben DBA 42
6651 Alex Programmer 128
9006 Claudia System-Administrator
128
Functional Dependencies
If there is a functional dependency between columns A and B in a given table which may be written
Then the value of column A determines the value of column B
employeeID functionally determines the name
€
A→ B
Schema
Database schema
Structure or design of the database Database without any data in it
employee(employeeID,name,job,departmentID)
Design
Minimize redundancy Redundancy: data is repeated in different
rows
employee(employeeID,name,job,departmentID,departmentName)
emplyeeID name job departmentID departamentName
7513 Nora Programmer 128 Research and Development
9842 Ben DBA 42 Finance
6651 Alex Programmer 128 Research and Development
9006 Claudia System-Administrator 128 Research and Development
Reduce redudancy
employee(employeeID,name,job,departmentID,departmentName)
employee(employeeID,name,job,departmentID)
employee(departmentID,name)
Insert Anomalies
Insert data into flawed table Data does not match what is already in
the table It is not obvious which of the rows in the
database is correct
Deletion Anomalies
Delete data from a flawed schema
When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists
Update Anomalies
Change data in a flawed schema
We do not change the data for every row correctly
Null Values
Avoid schema designs that have large numbers of empty attributes
Normalization
Remove design flaws from a database Normal forms, which are a set of rules
describing what we should and should not do in our table structures
Breaking tables into smaller tables that form a better design
Normal Forms
1 Forma Normal
2 Forma Normal
3 Forma Normal
5 Forma Normal
4 Forma Normal
Forma Normal Boyce Codd
First Normal Form (1NF)
Each attribute or column value must be atomic
Each attribute must contain a single value
emplyeeID name job departmentID
skills
7513 Nora Programmer 128 C, Perl, Java
9842 Ben DBA 42 DB2
6651 Alex Programmer 128 VB, Java
9006 Claudia System-Administrator 128 NT, Linux
1NF
emplyeeID name job departmentID skills
7513 Nora Programmer 128 C
7513 Nora Programmer 128 Perl
7513 Nora Programmer 128 java
9842 Ben DBA 42 DB2
6651 Alex Programmer 128 VB
6651 Alex Programmer 128 java
9006 Claudia System-Administrator 128 NT
9006 Claudia System-Administrator 128 Linux
Second Normal Form (2NF)
All attributes that are no part of the primary key are fully dependent on the primary key Each non key attribute must be functionally
dependent on the key Is already in 1NF
2NF ?
employee(employeeID,name,job,departmentID,skill)
emplyeeID name job departmentID skills
7513 Nora Programmer 128 C
7513 Nora Programmer 128 Perl
7513 Nora Programmer 128 java
9842 Ben DBA 42 DB2
6651 Alex Programmer 128 VB
6651 Alex Programmer 128 java
9006 Claudia System-Administrator 128 NT
9006 Claudia System-Administrator 128 Linux
Functional dependencies
employeeID,skill name, job, deparmentID employeeID name, job, deparmentID
Partially functionally dependent on the primary key
Not fully functionally dependent on the primary key
2NF
Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key
Breaking the table into two tables employee(employeeID,name,job,departmentID) employeeSkills(employeeID,skill)
Third Normal Form (3NF)
Remove all transitive dependencies Be in 2NF
employee(employeeID,name,job,departmentID,departmentName) employeeID name,job,departmentID,departmentName departmentID departmentName
employeID name job departmentID departmentName
7513 Nora Programmer 128 Research
9842 Ben DBA 42 Finance
6651 Ajay Programmer 128 Research
9006 Candy SYS 128 Research
Transitive dependency
employeeID departmentName employeeID deparmtentID
departmentID departmentName
3NF
Remove transitive dependency Decompose into multiple tables
emploee(employeeID,name,jop,departmentID) deparment(deparmentID,deparmtentName)
3NF
The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal) Boyce-Codd Normal Form
or
The right side of the functional dependency is a part of any key of the table
BCNF
All attributes must be functionally determined by a superkey
Full normalization means lots of logically seperate relations
Lots of logically separate relations means a lot of physically separate files
Lots of physically separate files means a lot of I/O
Difficulties in finding dimensions for dimensional schema, star schema (dimension tables, fact table)
What is Denormalization? Normalizing a relational variable R means replacing
R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn Reduce redundancy, each projections R1,R2,..,Rn is at
the highest possible value of normalization Denormalizing the relational variables means
replacing them by their join R Increase redundancy, by ensuring that R is a lower level
of normalization than R1,R2,..,Rn Problems
Once we start to denormalize, it is not clear when to stop?
Dimensional Schema Array cells often empty
The more dimensions, there more empty cells Empty cell Missing information How to treat not present information ? How does the system support
• Information is unknown• Has been not captured• Not applicable• ....
Solution?
Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data• e.g., occupation=“ ”
noisy: containing errors or outliers• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records
Why Is Data Dirty?
Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was
collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission
Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
No quality data, no quality mining results! Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility
Major Tasks in Data Preprocessing Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integration Integration of multiple databases, data cubes, or files
Data transformation Normalization and aggregation
Data reduction Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization Part of data reduction but with particular importance, especially
for numerical data
Forms of Data Preprocessing
Data Cleaning
Importance “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball “Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
Data is not always available E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred
Missing Values
The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)
null corresponds to UNK for unknown 3VL a mistake?
Boolean Operators
In scalar comparison in which either of the compared is UNK evaluates the unknown truth value
AND t u f
t t u f
u u u f
f f f f
OR t u f
t t t t
u t u u
f t u f
NOT
t f
u u
f t
MAYBE
Another important Boolean operator is MAYBE
MAYBE
t f
u t
f f
Example
Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000
EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND
DOB < DATE (‘1971-1-18’) AND
SALLARY < 40000 )
Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false
EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )
Numeric expression WEIGHT * 454
If WEIGHT is UKN, then the result is also UKN
Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK
Anomalies WEIGHT-WEIGHT=UNK (0) WEIGHT/0=UNK (“zero divide”)
UNK is not u (unk) UNK (the value-unknown null) u (unk) (unknown truth value) ...are not the same thing
u is a value, UNK not a value at all!
Suppose X is BOOLEAN Has tree values: t (true),f (false), u ukn X is ukn, X is known to be unk X is UKN, X is not known!
Some 3VL Consequences
The comparison x=x does not give true In 3VL x is not equal to itself it is happens to
be UNK The Boolean expression p OR NOT(p)
does not give necessarily true unk OR NOT (unk) = unk
Example Get all suppliers in Porto and take the union
with get all suppliers not in Porto We do not get all suppliers!
We need to add maybe in Porto In 2 VL p OR NOT(p) corresponds to p OR NOT(p) OR MAYBE(p) in 3VL
While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world
Some 3VL Consequences The expression r JOIN r does not
necessarily give r A=B and B=C together does not imply
A=C .... Many equivalences that are valid in 2VL
break down in 3VL We will get wrong answers
Special Values
Drop the idea of null and UNK,unk 3VL Use special values instead to represent
missing information
Special values are used in the real world In the real world we might use the special
value „?“ to denote hours worked by a certain employee if actual value is unknown
Special Values General Idea:
Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used
The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is
Approach is not very elegant, but without 3VL problems, because it is in 2VL
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or
decision tree
Relational Databases - Normalization Denormalization
Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values
Next..
Data Preprocessing
Visual inspection Noise Reduction Data Reduction Data Discretization Data Integration