chapter 2 part3

8/8/2019 Chapter 2 Part3

1/24

Ch2 Data Preprocessing part3

Amit Kr Upadhyay

Sharda University


2/24

Knowledge Discovery (KDD) Process

Data miningcore ofknowledge discovery

process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation


3/24

Forms of Data Preprocessing


4/24

Data Transformation Data transformation the data are

transformed or consolidated into formsappropriate for mining


5/24

Data Transformation Data Transformation can involve the

following:

Smoothing: remove noise from the data,including binning, regression and clustering

Aggregation

Generalization Normalization

Attribute construction


6/24

Normalization Min-max normalization

Z-score normalization Decimal normalization


7/24

Min-max normalization Min-max normalization: to [new_minA,

new_maxA]

Ex. Let income range $12,000 to $98,000

normalized to [0.0, 1.0]. Then $73,000 is

mapped to

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

!

716.00)00.1(000,12000,98

000,12600,73!


8/24

Z-score normalization

A

Avv

W

Q!'

225.1000,16

000,54600,73!


9/24

Decimal normalization Normalization by decimal scaling

Suppose the recorded value of A range from-986 to 917, the max absolute value is 986,

so j = 3

j

vv

10'! Wherej is the smallest integersuch that Max(||) < 1


10/24

Data Reduction Why data reduction?

A database/data warehouse may storeterabytes of data

Complex data analysis/mining may take avery long time to run on the complete data

set


11/24

Data Reduction Data reduction

Obtain a reduced representation of thedata set that is much smaller in volume butyet produce the same (or almost the same)analytical results


12/24

Data Reduction Data reduction strategies

Data cube aggregation

Attribute subset selection

Dimensionality reduction e.g., removeunimportant attributes

Numerosity reduction e.g., fit data intomodels

Discretization and concept hierarchygeneration


13/24

Data cube aggregation


14/24

Data cube aggregation Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with

Reference appropriate levels

Use the smallest representation which is enough

to solve the task


15/24


Dimensionality reduction Feature selection (i.e., attribute subset

selection):

Select a minimum set of features such that theprobability distribution of different classes giventhe values for those features is as close aspossible to the original distribution given thevalues of all features

reduce # of patterns in the patterns, easier tounderstand


16/24


Dimensionality reduction Heuristic methods (due to exponential

# of choices):

Step-wise forward selection

Step-wise backward elimination

Combining forward selection and backward

elimination Decision-tree induction


17/24


Dimensionality reductionInitialattribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class1 Class2 Class1 Class2

> Reduced attribute set: {A1, A4, A6}


18/24

Numerosity reduction Reduce data volume by choosing

alternative, smaller forms of datarepresentation

Major families: histograms, clustering,

sampling


19/24

DataR

eductionMethod:

Histog

rams

0 510

15

20

25

30

35

40

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000


20/24

Data Reduction Method:

Histograms Divide data into buckets and store average (sum) for

each bucket

Partitioning rules:

Equal-width: equal bucket range

Equal-frequency (or equal-depth)

V-optimal: with the least histogram variance (weighted sum

of the original values that each bucket represents)

MaxDiff: set bucket boundary between each pair for pairs

have the 1 largest differences


21/24


Clustering Partition data set into clusters based on similarity,

and store cluster representation (e.g., centroid and

diameter) only

There are many choices of clustering definitions and

clustering algorithms

Cluster analysis will be studied in depth in Chapter 7


22/24


Sampling Sampling: obtaining a small sample s to

represent the whole data set N

Simple random sample without replacement

Simple random sample with replacement

Cluster sample: if the tuples in D are groupedinto M mutually disjoint clusters, then an SimpleRandom Sample can be obtained, where s < M

Stratified sample


23/24

Sampling: with or without

Replacement

Raw Data


24/24

Sampling: Cluster or Stratified

SamplingRaw Data Cluster/Stratified Sample

chapter 2 part3

Documents