chapter 2 part3

Upload: yachanatalk

Post on 09-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Chapter 2 Part3

    1/24

    Ch2 Data Preprocessing part3

    Amit Kr Upadhyay

    Sharda University

  • 8/8/2019 Chapter 2 Part3

    2/24

    Knowledge Discovery (KDD) Process

    Data miningcore ofknowledge discovery

    process

    Data Cleaning

    Data Integration

    Databases

    Data Warehouse

    Task-relevant Data

    Selection

    Data Mining

    Pattern Evaluation

  • 8/8/2019 Chapter 2 Part3

    3/24

    Forms of Data Preprocessing

  • 8/8/2019 Chapter 2 Part3

    4/24

    Data Transformation Data transformation the data are

    transformed or consolidated into formsappropriate for mining

  • 8/8/2019 Chapter 2 Part3

    5/24

    Data Transformation Data Transformation can involve the

    following:

    Smoothing: remove noise from the data,including binning, regression and clustering

    Aggregation

    Generalization Normalization

    Attribute construction

  • 8/8/2019 Chapter 2 Part3

    6/24

    Normalization Min-max normalization

    Z-score normalization Decimal normalization

  • 8/8/2019 Chapter 2 Part3

    7/24

    Min-max normalization Min-max normalization: to [new_minA,

    new_maxA]

    Ex. Let income range $12,000 to $98,000

    normalized to [0.0, 1.0]. Then $73,000 is

    mapped to

    AAA

    AA

    A

    minnewminnewmaxnewminmax

    minvv _)__('

    !

    716.00)00.1(000,12000,98

    000,12600,73!

  • 8/8/2019 Chapter 2 Part3

    8/24

    Z-score normalization

    A

    Avv

    W

    Q!'

    225.1000,16

    000,54600,73!

  • 8/8/2019 Chapter 2 Part3

    9/24

    Decimal normalization Normalization by decimal scaling

    Suppose the recorded value of A range from-986 to 917, the max absolute value is 986,

    so j = 3

    j

    vv

    10'! Wherej is the smallest integersuch that Max(||) < 1

  • 8/8/2019 Chapter 2 Part3

    10/24

    Data Reduction Why data reduction?

    A database/data warehouse may storeterabytes of data

    Complex data analysis/mining may take avery long time to run on the complete data

    set

  • 8/8/2019 Chapter 2 Part3

    11/24

    Data Reduction Data reduction

    Obtain a reduced representation of thedata set that is much smaller in volume butyet produce the same (or almost the same)analytical results

  • 8/8/2019 Chapter 2 Part3

    12/24

    Data Reduction Data reduction strategies

    Data cube aggregation

    Attribute subset selection

    Dimensionality reduction e.g., removeunimportant attributes

    Numerosity reduction e.g., fit data intomodels

    Discretization and concept hierarchygeneration

  • 8/8/2019 Chapter 2 Part3

    13/24

    Data cube aggregation

  • 8/8/2019 Chapter 2 Part3

    14/24

    Data cube aggregation Multiple levels of aggregation in data cubes

    Further reduce the size of data to deal with

    Reference appropriate levels

    Use the smallest representation which is enough

    to solve the task

  • 8/8/2019 Chapter 2 Part3

    15/24

    Attribute subset selection

    Dimensionality reduction Feature selection (i.e., attribute subset

    selection):

    Select a minimum set of features such that theprobability distribution of different classes giventhe values for those features is as close aspossible to the original distribution given thevalues of all features

    reduce # of patterns in the patterns, easier tounderstand

  • 8/8/2019 Chapter 2 Part3

    16/24

    Attribute subset selection

    Dimensionality reduction Heuristic methods (due to exponential

    # of choices):

    Step-wise forward selection

    Step-wise backward elimination

    Combining forward selection and backward

    elimination Decision-tree induction

  • 8/8/2019 Chapter 2 Part3

    17/24

    Attribute subset selection

    Dimensionality reductionInitialattribute set:

    {A1, A2, A3, A4, A5, A6}

    A4 ?

    A1? A6?

    Class1 Class2 Class1 Class2

    > Reduced attribute set: {A1, A4, A6}

  • 8/8/2019 Chapter 2 Part3

    18/24

    Numerosity reduction Reduce data volume by choosing

    alternative, smaller forms of datarepresentation

    Major families: histograms, clustering,

    sampling

  • 8/8/2019 Chapter 2 Part3

    19/24

    DataR

    eductionMethod:

    Histog

    rams

    0 510

    15

    20

    25

    30

    35

    40

    10000

    20000

    30000

    40000

    50000

    60000

    70000

    80000

    90000

    100000

  • 8/8/2019 Chapter 2 Part3

    20/24

    Data Reduction Method:

    Histograms Divide data into buckets and store average (sum) for

    each bucket

    Partitioning rules:

    Equal-width: equal bucket range

    Equal-frequency (or equal-depth)

    V-optimal: with the least histogram variance (weighted sum

    of the original values that each bucket represents)

    MaxDiff: set bucket boundary between each pair for pairs

    have the 1 largest differences

  • 8/8/2019 Chapter 2 Part3

    21/24

    Data Reduction Method:

    Clustering Partition data set into clusters based on similarity,

    and store cluster representation (e.g., centroid and

    diameter) only

    There are many choices of clustering definitions and

    clustering algorithms

    Cluster analysis will be studied in depth in Chapter 7

  • 8/8/2019 Chapter 2 Part3

    22/24

    Data Reduction Method:

    Sampling Sampling: obtaining a small sample s to

    represent the whole data set N

    Simple random sample without replacement

    Simple random sample with replacement

    Cluster sample: if the tuples in D are groupedinto M mutually disjoint clusters, then an SimpleRandom Sample can be obtained, where s < M

    Stratified sample

  • 8/8/2019 Chapter 2 Part3

    23/24

    Sampling: with or without

    Replacement

    Raw Data

  • 8/8/2019 Chapter 2 Part3

    24/24

    Sampling: Cluster or Stratified

    SamplingRaw Data Cluster/Stratified Sample