data preprocessing

Data Preprocessing

ByS.Dinesh BabuII MCA

Definition

Data preprocessing is a data mining technique

that involves transforming raw data into an

understandable format.

Data in the real world is dirty

Measures for data quality: A multidimensional view

◦Accuracy: correct or wrong, accurate or not

◦Completeness: not recorded, unavailable, …

◦Consistency: some modified but some not,

dangling, …

◦Timeliness: timely update?

◦Believability: how trustable the data are

correct?

◦ Interpretability: how easily the data can be

understood?

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data

Discretization

Data Cleaning: IncompleteData is not always available

Ex:Age:” ”;

Missing data may be due to

◦ equipment malfunction

◦ inconsistent with other recorded data and thus deleted

◦ data not entered due to misunderstanding

◦ certain data may not be considered important at the time of entry

Noisy Data

Unstructured Data.

Increases the amount of storage space .

Causes:

Hardware Failure

Programming Errors

Data Cleaning as a ProcessMissing values, noise, and inconsistencies contribute to

inaccurate data.

The first step in data cleaning as a process is

discrepancy detection.

Discrepancies can be caused by several factors.

Poorly designed data entry forms

human error in data entry

The data should also be examined regarding:

o Unique rule:

Each attribute value must be different from all other attribute

value.

o Consecutive rule

No missing values between lowest and highest values of the

attribute.

o Null rule

Specifies the use of blanks, question marks, special

characters.

Data Integration

The merging of data from multiple data stores.

It can help reduce, avoid redundancies and

inconsistencies.

It improve the accuracy and speed of the subsequent

data mining process.

Data Reduction

To obtain a reduced representation of the data set that is

much smaller in volume.

Strategies for data reduction include the following:

Data cube aggregation, where aggregation operations

are applied to the data in the construction of a data cube.

Attribute subset selection, where irrelevant, weakly

relevant, or redundant attributes or dimensions may be

detected and removed.

Dimensionality reduction, where encoding mechanisms are

used to reduce the data set size.

Numerosity reduction, where the data are replaced or

estimated by alternative, smaller data representations such as

Parametric models

Nonparametric methods such as clustering, sampling,

and the use of histograms.

Data Transformation In data transformation, the data are transformed or

consolidated into forms appropriate for mining.

Data transformation can involve the following: Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small,

specified range min-max normalization

Data DiscretizationDiscretization: Divide the range of a continuous

attribute into intervals

◦ Interval labels can then be used to replace actual data

values

◦ Reduce data size by Discretization

◦ Split (top-down) vs. merge (bottom-up)

◦ Discretization can be performed recursively on an

attribute

◦ Prepare for further analysis, e.g., classification

Three types of attributes

◦ Nominal—values from an unordered set, e.g., color, profession

◦ Ordinal—values from an ordered set, e.g., military or academic rank

◦ Numeric—real numbers, e.g., integer or real numbers

Thank You

data preprocessing

Documents

missing data

data aggregation

data size

recorded data

raw data

merging of data

data quality

incomplete data