chapter 2 part3
Post on 09-Apr-2018
225 Views
Preview:
TRANSCRIPT
-
8/8/2019 Chapter 2 Part3
1/24
Ch2 Data Preprocessing part3
Amit Kr Upadhyay
Sharda University
-
8/8/2019 Chapter 2 Part3
2/24
Knowledge Discovery (KDD) Process
Data miningcore ofknowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
-
8/8/2019 Chapter 2 Part3
3/24
Forms of Data Preprocessing
-
8/8/2019 Chapter 2 Part3
4/24
Data Transformation Data transformation the data are
transformed or consolidated into formsappropriate for mining
-
8/8/2019 Chapter 2 Part3
5/24
Data Transformation Data Transformation can involve the
following:
Smoothing: remove noise from the data,including binning, regression and clustering
Aggregation
Generalization Normalization
Attribute construction
-
8/8/2019 Chapter 2 Part3
6/24
Normalization Min-max normalization
Z-score normalization Decimal normalization
-
8/8/2019 Chapter 2 Part3
7/24
Min-max normalization Min-max normalization: to [new_minA,
new_maxA]
Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
!
716.00)00.1(000,12000,98
000,12600,73!
-
8/8/2019 Chapter 2 Part3
8/24
Z-score normalization
A
Avv
W
Q!'
225.1000,16
000,54600,73!
-
8/8/2019 Chapter 2 Part3
9/24
Decimal normalization Normalization by decimal scaling
Suppose the recorded value of A range from-986 to 917, the max absolute value is 986,
so j = 3
j
vv
10'! Wherej is the smallest integersuch that Max(||) < 1
-
8/8/2019 Chapter 2 Part3
10/24
Data Reduction Why data reduction?
A database/data warehouse may storeterabytes of data
Complex data analysis/mining may take avery long time to run on the complete data
set
-
8/8/2019 Chapter 2 Part3
11/24
Data Reduction Data reduction
Obtain a reduced representation of thedata set that is much smaller in volume butyet produce the same (or almost the same)analytical results
-
8/8/2019 Chapter 2 Part3
12/24
Data Reduction Data reduction strategies
Data cube aggregation
Attribute subset selection
Dimensionality reduction e.g., removeunimportant attributes
Numerosity reduction e.g., fit data intomodels
Discretization and concept hierarchygeneration
-
8/8/2019 Chapter 2 Part3
13/24
Data cube aggregation
-
8/8/2019 Chapter 2 Part3
14/24
Data cube aggregation Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough
to solve the task
-
8/8/2019 Chapter 2 Part3
15/24
Attribute subset selection
Dimensionality reduction Feature selection (i.e., attribute subset
selection):
Select a minimum set of features such that theprobability distribution of different classes giventhe values for those features is as close aspossible to the original distribution given thevalues of all features
reduce # of patterns in the patterns, easier tounderstand
-
8/8/2019 Chapter 2 Part3
16/24
Attribute subset selection
Dimensionality reduction Heuristic methods (due to exponential
# of choices):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward
elimination Decision-tree induction
-
8/8/2019 Chapter 2 Part3
17/24
Attribute subset selection
Dimensionality reductionInitialattribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class1 Class2 Class1 Class2
> Reduced attribute set: {A1, A4, A6}
-
8/8/2019 Chapter 2 Part3
18/24
Numerosity reduction Reduce data volume by choosing
alternative, smaller forms of datarepresentation
Major families: histograms, clustering,
sampling
-
8/8/2019 Chapter 2 Part3
19/24
DataR
eductionMethod:
Histog
rams
0 510
15
20
25
30
35
40
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
-
8/8/2019 Chapter 2 Part3
20/24
Data Reduction Method:
Histograms Divide data into buckets and store average (sum) for
each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance (weighted sum
of the original values that each bucket represents)
MaxDiff: set bucket boundary between each pair for pairs
have the 1 largest differences
-
8/8/2019 Chapter 2 Part3
21/24
Data Reduction Method:
Clustering Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 7
-
8/8/2019 Chapter 2 Part3
22/24
Data Reduction Method:
Sampling Sampling: obtaining a small sample s to
represent the whole data set N
Simple random sample without replacement
Simple random sample with replacement
Cluster sample: if the tuples in D are groupedinto M mutually disjoint clusters, then an SimpleRandom Sample can be obtained, where s < M
Stratified sample
-
8/8/2019 Chapter 2 Part3
23/24
Sampling: with or without
Replacement
Raw Data
-
8/8/2019 Chapter 2 Part3
24/24
Sampling: Cluster or Stratified
SamplingRaw Data Cluster/Stratified Sample
top related